File Synchronization with Unison

Tweets by @markusgattol

Status: Finished except for the "Advanced Topics" section.

Last changed: Saturday 2015-01-10 18:32 UTC

Abstract:

This page is about file synchronization and Unison in particular. Unison is a piece of software that allows one to keep his files synchronized and backed up across different computers. This page describes the advantages of keeping files synchronized among several computers and provides a tutorial for setting up Unison.

Table of Contents

Theoretical Part

Peace of Mind
The Benefits of Unison
Who Should Use Unison
Security Concerns
Invariants
Remote Usage
How to Synchronize
Preferences respectively Switches
What shall I synchronize?
Topology

Practical Part

Installation
Preparatory Work
Configuring
Using Unison

Advanced Topics

Automating the Process

Theoretical Part

This section list many different things about and with regards to Unison. It is a loose collections of things that should be known by people before they move on to actually installing and setting up unison for their every day usage.

Peace of Mind

Unison provides us with two major benefits at once

Data backup and
Data synchronization

Unison, a free cross-platform file synchronization program, can not only provide us with multiple backups of our files, but more importantly, grant us the freedom to simultaneously use different computers with access to all of our files, thus liberating us from the confines of one particular machine.

Unison allows us to access the same set of files from any computer (running Mac OS X, Windows XP, or UNIX/Linux variants) and keeps these files up-to-date by always maintaining the most recently-modified version of each file during synchronization. I personally use Unison to keep replicas of all of my personal files across two different computers — my workstation and my subnotebook. I also use another computer (server) in the progress (more on that later).

Because my documents and configuration files are accessible from every computer, I am free to use whichever one is the most convenient at the moment without the hassle of transferring files using floppy disks, USB drives, or email. Lean back, relax and think about how much time you waste on those tasks every day? Me? Well, 3 or so seconds per day ...

For example, if someone is in the office on a Linux machine and wants to work on a paper for a class, he can just open up the file and start typing. Before he leave, he simply synchronizes that directory to his server located somewhere in the Internet. Then when he gets home, he runs Unison again to synchronize on his Mac and continues working on his paper.

If he feels like watching TV later while continuing to work, he can simply switch to using his Windows laptop. Then he can finish up final edits at the office the next morning on his Linux box. By the time he is done for the night, not only has he edited the same paper on three different computers without the hassle of emailing copies to himself, but he has also three identical copies of it so that if any one of his computers blows up, he can still turn in his paper on time.

Unison has allowed him to have the peace of mind that comes with having his files seamlessly backed-up while he is working on them and also the freedom of being able to do his work wherever and whenever is convenient.

This section describes some of the benefits of using Unison and provides some tips on doing so

The Benefits of Unison

This subsection list some of the most obvious benefits that come with using Unison.

Liberation from a particular Computer or Operating System

This is perhaps the most practical and visible advantage of using Unison in our daily computing life. If we can have access to our files from any machine that we use (and assuming that we have programs on each machine that can utilize these files), then it really does not matter which one we use.

Furthermore, if we can put custom configuration files for our shell or applications in our Unison hierarchy and simply use symlinks to refer to these copies on every computer, then we can have a uniform working environment. For example, one might use the Bash shell on every computer, whether it be Windows XP with Cygwin, Mac OS X, FreeBSD, or whatever Linux he has in front of him at the moment. He might then have a common Bash configuration file shared by all machines, and files particular to each machine. His command prompt looks the same on all machines, and he can use all of the same aliases and shell functions. When he finds a cool Bash function when browsing the Internet at work, he can simply add it to his common Bash config file, sync it, and when he gets home at night, he can access that same function on his home machine.

This freedom allows us to transcend the incessant bickering over which operating system is better — we can use whatever OS has the programs we want for some particular application, or simply whatever OS is in front of us at the moment.

Live backups via file replication

Our personal data (documents, photographs, emails, etc...) is the most valuable component in our interaction with computers, because it can be irreplaceable if lost.

Data backup is something that everybody should do, but unfortunately, few people do it on a regular basis. In contrast to traditional backup methods, the great benefit of using Unison to replicate our files across different computers is that our backups are alive. They are not sitting on some archive tape in the basement i.e. they are on the hard drives of each and every computer we use.

Seamless control and Verification of Backups

By synchronizing our Unison file replicas, we are the one who controls our backups so that we can be confident that they are being performed correctly. We verify the integrity of our backups simply by switching computers and accessing the files during our normal course of work.

What often happens to people is that we think that our organization is properly backing up our files, when in fact they are not. We never consider backing up our own files because we know that our company takes care of that (better double check that!). If we lose a file, we do not sweat because we know that the sysadmins have a backup, but to our surprise, their backup was not done properly... that is when backup trauma strikes. With Unison, though, we control our own backups, and the more replicas we have, the less likely it is that we will lose our precious data.

Fast and non-traumatic Recovery from Hardware Failures

A hard drive crash or total computer meltdown is traumatic for most people. Why? Not because they need to pay a few hundred dollars for new hardware, but because they have just lost most or all of their precious data.

If they are somewhat diligent about backups, they probably have some old backup CDs, dating back a few months, but that is still a few months of lost work. With Unison, we back up basically as often as we use our computer, so we will at worst lose only the data that we have immediately been working on for the past few hours. If one of our machines dies, then it is annoying to pay to buy new hardware and install our OS and software again (which is trivial if you have an OS with automated package management software such as Fink, RPM Manager, or APT (Advanced Packaging Tool) or even better something like FAI (Fully Automatic Installation)), but it is non-traumatic because we have not lost any data.

If we replicated the configuration files for our favorite applications, then restoring their pre-crash state is as easy as re-installing and moving those files back to the correct places. Unison allows data to transcend hardware — after all, hardware is cheap and plentiful, but our data is irreplaceable and worth a lot.

Who Should Use Unison

I am not going to preach that everybody in the world should use Unison. I think that everybody should back-up their data regularly, but Unison is overkill for simply backing up data. However, for those who use more than one computer on a regular basis, those can probably gain benefits from Unison. Here are some typical configurations for different types of users:

Casual Home User with no access to a Server

The typical home user who has a laptop and desktop computer but no access to a file server probably uses a removable USB stick or hard drive to shuttle files back and forth between his computers.

With Unison, he can still use that method of transferring data, except that he can be confident that all of his computers will always have up-to-date copies of files (as long as he remembers to synchronize i.e. invoke unison).

For example, he can do some work on his laptop, synchronize with the removable drive, move the drive to the desktop computer, synchronize again before he starts working, and therefore has both computers (as well as the removable drive) contain the most recent versions of all files, regardless of which computer he used to edit them.

University Student / On-line Storage Owner

A student at a modern university probably has a certain amount of storage space on the university servers as well as SSH (Secure Shell) remote login access, which is enough to run Unison.

He should definitely take advantage of this space because it is probably well-maintained and regularly backed-up respectively it probably runs on high-end hardware... maybe even a SAN (Storage Area Network).

After all, our tuition is helping to pay the salaries of people who are in charge of protecting our data. We can synchronize our various machines against the school's servers and therefore have a very well secured storage for a relatively low price.

Server Administrator

The ideal way to run Unison is if one can set up his own personal server with SSH login capabilities (this is possible with any flavor of UNIX or Linux, Mac OS X, and Windows XP with Cygwin).

My suggestion is to dedicate one computer as our Unison server (especially easy using virtualization e.g. OpenVZ) which holds all of our relevant data and synchronize all of our other computers (workstation, subnotebook, etc.) to that server. I use that setup also known as star topology setup.

Security Concerns

When I first tell people about the benefits of keeping multiple replicas of their personal data on different machines, preferably at different physical locations, one recurring concern is security.

If I place my data on the university server, will other people have access to it?
If I start my own server and run Unison via SSH (Secure Shell), can anybody on the Internet connect to it and see my data?
etc.

It is true that, the more places our data resides, the more vulnerable it is to third-party snoopers. However, if we are careful with choosing a strong passphrase for user accounts, if we use secure tools like SSH (preferably in a PKA (Public Key Authentication) setup), if we use block-layer encryption, and, if we store our data on reliable/trusted servers only, then everything is fine.

In short, high security is possible but it takes a person with skills and a lot of time to get there and even more important, to stay there. Security is not a one time shoot but a steady process!

Well, since I am a bit paranoid, meticulous and... well, I know a trick or two ;-]... my bottom line is, I use encrypted connections in between my data sinks and sources, block-layer encryption, IDS (Intrusion Detection System), firewall, honey pot and a bunch of other cunning things to stay safe.

Invariants

Given the importance and delicacy of the job that Unison performs, it is important to understand both what a synchronizer does under normal conditions and what can happen under unusual conditions such as system crashes and communication failures.

Unison is careful to protect both its internal state and the state of the replicas at every point in this process. Specifically, the following guarantees are enforced:

At every moment, each path in each replica has either (1) its original contents (i.e., no change at all has been made to this path), or (2) its correct final contents (i.e., the value that the user expected to be propagated from the other replica).
At every moment, the information stored on disk about Unison's private state can be either (1) unchanged, or (2) updated to reflect those paths that have been successfully synchronized.

The upshot is that it is safe to interrupt Unison at any time, either manually or accidentally.

Caveat: the above is almost true there are occasionally brief periods where it is not (and, because of shortcoming of the Posix filesystem API, cannot be). In particular, when it is copying a file onto a directory or vice versa, it must first move the original contents out of the way. If Unison gets interrupted during one of these periods, some manual cleanup may be required. In this case, a file called DANGER.README will be left in our home directory, containing information about the operation that was interrupted. The next time we run Unison, it will notice this file and warn us about it.

If an interruption happens while it is propagating updates, then there may be some paths for which an update has been propagated but which have not been marked as synchronized in Unison's archives. This is no problem since the next time Unison runs, it will detect changes to these paths in both replicas, notice that the contents are now equal, and mark the paths as successfully updated when it writes back its private state at the end of this run.

If Unison is interrupted, it may sometimes leave temporary working files (with suffix .tmp) in the replicas. It is safe to delete these files. Also, if the backups flag is set, Unison will leave around old versions of files that it overwrites, with names like file.0.unison.bak. These can be deleted safely when they are no longer wanted.

Unison is not bothered by clock skew between the different hosts on which it is running. It only performs comparisons between timestamps obtained from the same host, and the only assumption it makes about them is that the clock on each system always runs forward.

If Unison finds that its archive files have been deleted (or that the archive format has changed and they cannot be read, or that they do not exist because this is the first run of Unison on these particular roots), it takes a conservative approach i.e. it behaves as though the replicas had both been completely empty at the point of the last synchronization. The effect of this is that, on the first run, files that exist in only one replica will be propagated to the other, while files that exist in both replicas but are unequal will be marked as conflicting.

Touching a file without changing its contents should never affect whether or not Unison does an update.

When running with the fastcheck preference set to true — the default on Unix systems — Unison uses file modtimes for a quick first pass to tell which files have definitely not changed. Then, for each file that might have changed, it computes a fingerprint (also known as checksum also known as hash) of the file's contents and compares it against the last-synchronized contents. Also, the -times option allows us to synchronize file times, but it does not cause identical files to be changed i.e. Unison will only modify the file times.

It is safe to brainwash Unison by deleting its archive files on both replicas. The next time it runs, it will assume that all the files it sees in the replicas are new.

It is safe to modify files while Unison is working. If Unison discovers that it has propagated an out-of-date change, or that the file it is updating has changed on the target replica, it will signal a failure for that file. In such case, running Unison again will propagate the latest changes.

Changes to the ignore patterns from the user interface (e.g., using the i key) are immediately reflected in the current profile.

Remote Usage

There are two basic choices to synchronize data with some remote machine

SSH (Secure Shell) or
socket connection

SSH is the standard/preferred method. Most folks — including me — never tried the socket connection method.

How to Synchronize

There are four possible choices

Synchronize our whole home directory, using the ignore facility to avoid synchronizing temporary files and things that only belong on one host.
Create a subdirectory called e.g. ~/shared in our home directory on each host, and put all the files we want to synchronize into this directory.
Create a subdirectory e.g. ~/shared in our home directory on each host, and put links to all the files we want to synchronize into this directory. Use the follow preference to make Unison treat these links as transparent.
Make our home directory i.e. ~/ or the file system root i.e. / the root of the synchronization, but tell Unison to synchronize only some of the files and subdirectories within it on any given run. This can be accomplished by using the path facility.

I recommend using #4. I choose / to be the root for synchronization and then used the path facility to selectively pick all the files respectively directories I wanted to have synchronized. This gives me the greatest flexibility and leaves enough space for changes and adaptations whenever I need them.

What I find even more important, using approach #4 keeps things small and simple i.e. easy to maintain even after months or years of usage — try this with symlinks i.e. #3... I have been there, simply does not scale... However, how exactly my config for #4 looks like can be seen in my config file further down.

Preferences respectively Switches

The unison manual lists all possible switches. This subsection contains a subset of facilities/switches that I find worth mentioning here explicitly. I use pretty much all the facilities listed here.

auto: When set to true, this flag causes the user interface to skip asking for confirmations on non-conflicting changes. (More precisely, when the user interface is done setting the propagation direction for one entry and is about to move to the next, it will skip over all non-conflicting entries and go directly to the next conflict.)

batch: When this is set to true, the user interface will ask no questions at all. Non-conflicting changes will be propagated; conflicts will be skipped.
confirmbigdeletes: When this is set to true, Unison will request an extra confirmation if it appears that the entire replica has been deleted, before propagating the change. If the batch flag is also set, synchronization will be aborted. When the path preference is used, the same confirmation will be requested for top-level paths. (At the moment, this flag only affects the text user interface.) See also the mountpoint preference.
fastcheck xxx: When this preference is set to true, Unison will use the modification time and length of a file as a pseudo inode number when scanning replicas for updates, instead of reading the full contents of every file. Under Windows, this may cause Unison to miss propagating an update if the modification time and length of the file are both unchanged by the update. However, Unison will never overwrite such an update with a change from the other replica, since it always does a safe check for updates just before propagating a change. Thus, it is reasonable to use this switch under Windows most of the time and occasionally run Unison once with fastcheck set to false, if you are worried that Unison may have overlooked an update. The default value of the preference is auto, which causes Unison to use fast checking on Unix replicas (where it is safe) and slow checking on Windows replicas. For backward compatibility, yes, no, and default can be used in place of true, false, and auto. See the Fast Checking section for more information.
follow xxx: Including the preference -follow pathspec causes Unison to treat symbolic links matching pathspec as invisible and behave as if the object pointed to by the link had appeared literally at this position in the replica. See the Symbolic Links section for more details. The syntax of pathspec> is described in the Path Specification section.
force xxx: Including the preference -force root causes Unison to resolve all differences (even non-conflicting changes) in favor of root. This effectively changes Unison from a synchronizer into a mirroring utility which I think is totally cool! Before I started using Unison, I used rsync a lot but now I dropped it and the tasks I did with rsync before are now carried out with Unison and its force feature. The benefit of this is, one tool less to care about and another one that is way mightier anyway.; You can also specify -force newer (or -force older) to force Unison to choose the file with the later (earlier) modtime. In this case, the -times preference must also be enabled. This preference is overridden by the forcepartial preference.
forcepartial xxx: Including the preference forcepartial PATHSPEC -> root causes Unison to resolve all differences (even non-conflicting changes) in favor of root for the files in PATHSPEC (see the Path Specification section for more information). This effectively changes Unison from a synchronizer into a mirroring utility.; You can also specify forcepartial PATHSPEC -> newer (or forcepartial PATHSPEC older) to force Unison to choose the file with the later (earlier) modtime. In this case, the -times preference must also be enabled.
times: When this flag is set to true, file modification times (but not directory modtimes) are propagated.
group: When this flag is set to true, the group attributes of the files are synchronized. Whether the group names or the group identifiers are synchronizeddepends on the preference numerids.
owner: When this flag is set to true, the owner attributes of the files are synchronized. Whether the owner names or the owner identifiers are synchronizeddepends on the preference extttnumerids.
perms n: The integer value of this preference is a mask indicating which permission bits should be synchronized. It is set by default to 0o1777: all bits but the set-uid and set-gid bits are synchronised (synchronizing theses latter bits can be a security hazard). If you want to synchronize all bits, you can set the value of this preference to -1
ignore xxx: Including the preference -ignore pathspec causes Unison to completely ignore paths that match pathspec (as well as their children). This is useful for avoiding synchronizing temporary files, object files, etc. The syntax of pathspec is described in the Path Specification section, and further details on ignoring paths is found in the Ignoring Paths section.
immutable xxx: This preference specifies paths for directories whose children are all immutable files — i.e., once a file has been created, its contents never changes. When scanning for updates, Unison does not check whether these files have been modified; this can speed update detection significantly (in particular, for mail directories).
log: When this flag is set, Unison will log all changes to the filesystems on a file.
sortbysize: When this flag is set, the user interface will list changed files by size (smallest first) rather than by name. This is useful, for example, for synchronizing over slow links, since it puts very large files at the end of the list where they will not prevent smaller files from being transferred quickly.; This preference (as well as the other sorting flags, but not the sorting preferences that require patterns as arguments) can be set interactively and temporarily using the Sort menu in the graphical user interface.
sortnewfirst: When this flag is set, the user interface will list newly created files before all others. This is useful, for example, for checking that newly created files are not junk, i.e., ones that should be ignored or deleted rather than synchronized.
maxthreads n: This preference controls how much concurrency is allowed during the transport phase. Normally, it should be set reasonably high (default is 20) to maximize performance, but when Unison is used over a low-bandwidth link it may be helpful to set it lower (e.g. to 1) so that Unison doesn't soak up all the available bandwidth.
rshargs xxx: The string value of this preference will be passed as additional arguments (besides the hostname and the name of the Unison executable on the remote system) to the rsh command used to invoke the remote server.

What shall I synchronize?

Well, that is entirely up to anybody himself. Here is what I do

,----[ head -n40 ~/.unison/common ]
| ## Paths (directories resp. files) to synchronize
| #directories
| path = home/sa/Desktop
| path = home/sa/Mail
| path = home/sa/News
| path = home/sa/misc
| path = home/sa/work/git
| path = home/sa/em
| path = home/sa/mm
| path = usr/local/sbin
| path = home/sa/.purple
| path = home/sa/.mozilla
| path = home/sa/.workrave
| path = home/sa/.sec
| path = home/sa/.local
| path = home/sa/.emacs.d
|
|
| #files
| path = home/sa/.bashrc
| path = home/sa/.bash_history
| path = home/sa/.bash_profile
| path = home/sa/.dingrc
| path = home/sa/.emacs
| path = home/sa/.emacs.elc
| path = home/sa/.emacs.desktop
| path = home/sa/.emacs.desktop.lock
| path = home/sa/.dired
| path = home/sa/.adobe
| path = home/sa/.unison/common
| path = usr/share/games/fortunes/mjg
| path = usr/share/games/fortunes/mjg.dat
| path = usr/share/games/fortunes/mjg.u8
|
|
| ## Data not to be synchronized
| ignore = Path home/sa/mm/audio/music
| ignore = Path home/sa/.adobe/Acrobat/8.0/Synchronizer
|
`----

As can be seen, I synchronize a bunch of directories (recursively) and a bunch of files. Last but not least, I also use the ignore facility in order to ignore some paths. As I mentioned earlier, I use approach #4 from above.

Topology

This is about quite the same as we already discussed before. The way how Unison can be used and therefore how the synchronization process happens can be best described via well-known network topologies. Unison can do pretty much all of them. However, the majority of people use just two — Line and Star that is.

Line

Let us assume one has n computers and therefore also n replicas of his data that need be synchronized. For reasons of simplicity let us just say n is 3 for this example — namely A, B and C. What type they are i.e. workstation, laptop etc. does not matter here. Although, it does not matter what particular OS (Operating System) A, B or C runs at the time of syncing data among them. The only thing of interest now is how it is done. Here it is

1  A -> B -> C or
2  B -> A -> C or
3  B <- C <- A
4  etc.

In words, in line 1, we make changes to a file called our_file on machine A. Without syncing, B and C have the same version of our_file but not A. A has the current version of our_file. Then we sync machine A with B. After we did so, A and B have the same version of our_file but not C. C still has the old version of our_file. Finally we sync B with C. Now all machines have the same version of our_file.

One can think about various other combinations about what can be done with a line topology and what not. Bottom line is, if we want to have the same version of our_file on all our computers, once we start syncing, we have to sync through the whole line until we arrive at the last computer. For n computers, that would be n-1 syncs! Clearly, a line topology does not scale and this is where the star topology enters the room.

Star

Using Unison in a star topology requires at least three replicas (the reader should not confuse replica with computer). That is, for the minimum of three replicas the following would work

Two computers and one external USB HDD (Hard Disk Drive). Each computer as well as the USB HDD contain one replica. The USB HDD becomes the center node within our star (passive node) i.e. it does not actively make changes to its replica... it just receives changes and distributes them to our two active nodes. Of course, there can be many more active nodes than just two. What remains a constant with the star topology is that there can only be one center i.e. only one passive node.
Two computers and a third one, acting as a server. Therefore, the server becomes the center of the star and therefore the passive node — it just receives and distributes changes. The two active nodes i.e. our two computers always synchronize with the center i.e. the passive node.

I love using Unison with a star topology. I have 3 computers at home that I use actively — one server (passive node; center node within the star), a workstation and a subnotebook (both active nodes). Therefore, I never synchronize between the workstation and the subnotebook — with the star topology it is not allowed that any two active nodes synchronize themselves. Only can a synchronization happen through the passive node (the center of the star)!

Where the full power of using Unison with a star topology becomes obvious is if we are on the go a lot and the passive node is accessible via a secure connection (e.g. SSH) over the Internet from any point in the world, at any times.

This is exactly what I do. My server, acting as the passive node (center of the star) is located within a datacenter. Now, no matter where I am on this planet, as long as I have connectivity to the Internet, I can not just synchronize my data with the passive node but I also get my backup of my precious data in one go.

Back at home from a trip to Africa or whatever, I synchronize my workstation with my passive node located far away in a highly secured datacenter and after some seconds respectively minutes, all machines/replicas (subnotebook, workstation and server i.e. active nodes and passive node) have the exact same up-to-date versions of all my data.

Actually, my setup is a bit more complex i.e. I have one server located in the datacenter and one at home. The two servers synchronize themselves respectively their replicas using a simple line topology. This is triggered automatically and requires no human interaction whatsoever. I use inotify, cron and incron to trigger the synchronize with unison.

Depending on where I am with the each of my active nodes (wks or sub), they either synchronize themselves with my server at home or the one located in the datacenter.

Practical Part

This part takes into account all the afore mentioned aspects of living a life with Unison. After reading that subsection, everyone should be able to install, setup configure and fine tune Unison to fit his needs. However, it is not meant to be a comprehensive guide, and is merely a supplement to the official Unison manual.

Installation

One needs to install one respectively two packages:

1  sa@wks:~$ type dpl
2  dpl is aliased to `dpkg -l'
3  sa@wks:~$ dpl unison* | grep ^ii
4  ii  unison         2.27.57-1+b1   A file-synchronization tool for Unix and Win
5  ii  unison-gtk     2.27.57-1+b1   A file-synchronization tool for Unix and Win
6  sa@wks:~$

As can be seen in line 2, dpl is yet another alias in my .bashrc. What we actually need to install is the package from line 4. I have already installed it as can be seen. For those who have not, apt-get install unison respectively aptitude install unison does the trick.

Line 5 is a nice to have but one should never need it if he is comfortable with the CLI (Command Line Interface). Personally I am now going to remove unison-gtk from my systems since I am a CLI fanboy rather than a GUI (Graphical User Interface) kind of person. However, on some of my other computers I keep both around and use

sa@wks:~$ su
Password:
wks:/home/sa# update-alternatives --config unison

There are 2 alternatives which provide `unison'.

  Selection    Alternative
-----------------------------------------------
 +        1    /usr/bin/unison-latest-stable
*         2    /usr/bin/unison-latest-stable-gtk

Press enter to keep the default[*], or type selection number: 1
Using '/usr/bin/unison-latest-stable' to provide 'unison'.
wks:/home/sa# exit
exit
sa@wks:~$

to switch among the two. In the example above I decided to go with the non-gtk version as can be seen.

Preparatory Work

Now that we have installed Unison, there are a few things that I recommend should be done before one hits the road to unlimited file synchronization.

Organizing our Files and overall File System Structure

We need to organize all of the files we want to synchronize in our replicas. Before we run Unison for the first time on our data, it is important that all of our files and folders are named and organized the way that we want it to be.

This is because Unison does not know when things are renamed. If for example memo.txt is renamed to memo-pad.txt, then Unison thinks that the file memo.txt has been deleted and a new file memo-pad.txt was created. Of course, we can re-name files and directories all we want, but Unison will simply think that we deleted and created identical new versions, which could get annoying especially when we are renaming a subdirectory containing tens of Gigabytes of data. Unison would re-transfer the whole shebang from one replica into the other just because we might have renamed ../my_mp3s to ../music i.e. issued mv my_mp3s music

I suggest that all of our files be organized in sub-directories under one main directory, which will be the root directory for our synchronization. Again, take a look at approach #4 from above.

Bottom line here is, before we issue unison for the first time, all data planned to be synchronized in the future should be the same in all replicas. However, this is just a recommendation and no mandatory thing since unison can do it itself — it would just take longer and also, by cleaning up a bit and reorganizing his data, one might actually get rid of some dust that set in over the years. Remove crap, rename awkward stuff, consolidate stuff, remove duplicates (fdupes, fslint, etc.), etc. Clean up your file system(s) ladies and gents! ;-]

Determining our Roots

We need to now figure out which computers and hard drives we want to use to house the replicas of our files (these locations are called roots), and how they are going to communicate with one another (either locally or remotely).

I recommend a star topology where one server (if possible) with a constant Internet connection is the central root, and all other computers synchronize with it remotely via SSH. This effectively turns the Unison peer-to-peer system into a client-server system. If someone does not have access to a server, then he might use a removable hard drive, acting as his central root and move it to different computers whenever he wants to synchronize the files/replicas.

Configuring

This is the point where everyone should have taken his time with the theoretical part of this section and the official unison manual already. I am not going to provide the reader with yet another version of the official manual. I am just showing my Unison config file(s) and explain a bit what I did and why I did what I did.

Setting up our Unison profile

On the computer where we are invoking Unison, it looks for a profile located in the ~/.unison directory to know which two locations (called roots) to synchronize and which options to invoke Unison with. Here is how my setup works. With every profile there are

settings specific to the current node/replica where Unison is invoked (I put these settings into a file called ~/.unison/default.prf) respectively
there are settings common for any node/replica (I put these settings into a file called ~/.unison/common) involved

What does this mean? Well, since every profile file can be divided into local and common parts, we can split a profile file into two files — one containing the common parts for all nodes/replicas and one containing information specific to one particular node/replica. As a consequence, we can share the file containing the common parts among all involved nodes/replicas. This is what I do, sharing the common parts and keeping one node specific file per node/replica.

Below follows the node specific part on my subnotebook's profile file. The file containing the node specific part is called default.prf as can be seen in line 1. Lines 5 and 6 show the node specific information in detail (it is the roots i.e. the two replicas which Unison synchronizes based on the profile file it reads with every run).

 1  sa@sub:~$ cat .unison/default.prf
 2  ## Unison preferences file
 3
 4  ## Roots of the synchronization
 5  root = /
 6  root =  ssh://192.168.1.4:1235//home/sa/ur/0/
 7
 8  ## Include common settings for profiles no matter where they are
 9  ## invoked (client or server)
10  include common

Line 5 shows the file system root (i.e. /) on my subnotebook and line 6 is the URI (Uniform Resource Identifier) that points to a directory on my server at home (the one who synchronizes himself with my other server located in the datacenter; see above), located within my LAN at home. /home/sa/ur/0/ is the directory on my server where I keep the replica on the server. As I mentioned earlier, I use approach #4 from above.

There is one thing that should be known. Unison can be invoked on the CLI by typing unison <profile_name> e.g. unison wks2sub. This requires us to have a file called wks2sub.prf in ~/.unison. If we just type unison without providing a profile's file name, then Unison will use default.prf.

Line 10 shows how we include the common part of the profile file. We keep the common parts for all nodes in a separate file e.g. common in my case. Then we can use include to include the common parts of the profile file with every run of Unison. Here is how my common part looks like.

11  sa@sub:~$ cat .unison/common
12  ## Paths (directories resp. files) to synchronize
13  #directories
14  path = home/sa/Desktop
15  path = home/sa/Mail
16  path = home/sa/News
17  path = home/sa/misc
18  path = home/sa/work/git
19  path = home/sa/em
20  path = home/sa/mm
21  path = usr/local/sbin
22  path = home/sa/.adobe
23  path = home/sa/.purple
24  path = home/sa/.mozilla
25  path = home/sa/.workrave
26  path = home/sa/.sec
27  path = home/sa/.local
28  path = home/sa/.emacs.d
29
30
31  #files
32  path = home/sa/.bashrc
33  path = home/sa/.bash_history
34  path = home/sa/.bash_profile
35  path = home/sa/.dingrc
36  path = home/sa/.emacs
37  path = home/sa/.emacs.elc
38  path = home/sa/.emacs.desktop
39  path = home/sa/.emacs.desktop.lock
40  path = home/sa/.dired
41  path = home/sa/.unison/common
42  path = usr/share/games/fortunes/mjg
43  path = usr/share/games/fortunes/mjg.dat
44  path = usr/share/games/fortunes/mjg.u8
45

Because I opted for approach #4, it is actually pretty simple but yet very powerful — something I like a lot! Lines starting with # are comments. In lines 14 to 44 I am specifying what shall be synchronized between the replicas i.e. my subnotebook and the server in this particular case (remember folks, we are looking at the common part of my profile file on my subnotebook which happens to be an active node in a star topology scenario). In more detail, lines 14 to 28 specify directories (recursively). Lines 32 to 44 are single files.

46
47
48  ## Data not to be synchronized
49  ignore = Path home/sa/mm/audio/music
50  ignore = Path home/sa/.adobe/Acrobat/8.0/Synchronizer
51
52  ## Miscellaneous settings
53  rshargs = -C
54  auto =true
55  confirmbigdeletes = true
56  perms = -1
57  owner = true
58  group = true
59  times = true
60  #force = newer
61  sortbysize = true
62  sortnewfirst = true
63  maxthreads = 50
64  log = true
65  logfile = /home/sa/.unison/unison.log
66  sa@sub:~$

In lines 49 and 50 I specified two paths I want to ignore i.e. although they are located within a path that is listed for synchronization (line 20 respectively line 22), I am excluding them from synchronization. Lines 53 to 65 contain various settings regarding the overall synchronization process. I already listed a description about their meaning above in this section.

Using Unison

Ok, now that we have made our initial copies, renames, deletes, etc. and set-up a basic profile which tells Unison which two locations (roots) to synchronize, we are ready to run Unison for the first time.

We can invoke Unison by typing unison, and it will use the options in default.prf (and common because of the include common statement). During this first run, Unison will take quite a long time because it traverses through all files and builds up auxiliary metadata about each one of them (stored in a file in the ~/.unison/ directory). After it is done, it will ask questions when there are conflicts between files. One can press ? to see the choices that we have when Unison asks us questions.

However, no files should be different during this initial run because we have just made a fresh identical copy across the two roots.

After Unison finishes propagating all changes (not that there should be some on the initial run since we did our preparatory work), those two roots have now been initialized. When we run Unison again on those two roots, it should go much faster because the metadata has already been stored. We need to repeat this process with every pair of roots that we want to synchronize. If possible, I suggest that one adopts a star topology and synchronize all roots against a central server/replica/root, which minimizes the number of pair-wise synchronizations required.

About speed... Well, I often heard folks complaining Unison is slow ... Sure it is if one is not using approach #4 i.e. changes substantial parts every now and then. This of course means Unison has to initialize all its meta data over and over again. Also, not using rshargs -C can be a cause. Another reason might be a slow connection if both replicas are not located on the same computer. I can only say so much that I keep around 180 GiB in sync and except for the initial buildup of its meta data, running Unison takes no more than ~60 seconds under normal circumstance i.e. no need to copy tons of new mp3s over WiFi and stuff like that. Unison makes use of the rsync algorithm and therefore it is damn fast if used correctly.

We have to remember to synchronize every time right after we login to a machine and right before we logout. Unison is only effective if we use it! Maybe we want to trigger Unison some other way instead of invoking it manually every time?! I do so. What I do exactly and other more advanced aspects of using Unison follows below.

Advanced Topics

What we did so far is already pretty sophisticated and covers a lot of use cases. However, there are a few things that I consider useful...

Automating the Process

One thing we can do in order to make things even more comfortable and suitable for the forgetting mind is to automate the whole process i.e. we take our Unison configuration and trigger the synchronization process depending on certain circumstances. Those circumstances can either be

Event Triggered i.e evertime something happens Unison starts syncronizing or
Time Triggered i.e. Unison synchronizes at a specific time respectively periodically.
A combination of the two e.g. assuming there is a guy called Markus, using the star topology approach with Unison in order to synchronize his phone, subnotebook and workstations with his server located inside a datacenter. Markus has several requirements to this process in order to make it as secure, comfortable and smooth as possible:
- Any time he is using one of his computers or gadgets (phone, subnotebook or workstation) and this thing gets connected to the Internet it should try to (there may be firewalls blocking this approach; think of Internet Cafes etc.) synchronize with the server OR (logical OR that is)
- should the computer/gadget already be connected to the Internet and the last synchronization with the server has happened more than 120 minutes (or whatever number one picks here) ago (this could be set to different prime numbers (e.g. 113 minutes interval for the subnotebook and 127 minutes interval for the workstation) on both computers so it is not going to happen both try to synchronize with the server at the same time), then the computer/gadget should synchronize with the server. That however should just happen, if, in the meantime (since the last run), any data has changed in one of the both replicas (i.e. either on the server or the computer/gadget itself) AND in case both are already connected to the Internet and some very important files/directories (which would get synchronised all 120 minutes anyway) change, Unison should start syncing these changes immediately.

Either ways, what needs to be done is to figure a way how to

Avoid that Unison keeps asking any questions at all that would require human interaction in the process. This can be done using Unision's batch and auto option.
Manage to provide Unison with our authentication credentials for SSH (in case we synchronize over a TCP/IP network using SSH which is what we do). This can be done using PKA (Public Key Authentication) with the SSH-agent set up.
Make sure we have a consistent view on the data while syncing (read snapshot) i.e. we need data persistence and data integrity while syncing. This can be achieved in many ways like for example using LVM Snapshots or enterprise-class RAID HBAs (Host Bus Adapters). I am planning to use BTRFS (B-Tree File System) so basically that is the reason why I am now (November 2009) still triggering the sync manually since I need to shut down a few applications like for example iceweasel and pidgin so data does not change while I sync with the server.

Event Triggered

In case we wanted

inotify respectively incron
- dnotify
udev
some home-made stuff

Out of the box inotify does not watch subdirectories. However, there are tons of wrappers out there

wks:/home/sa# type afs; afs pynotify | grep so
afs is aliased to `apt-file search'
python-notify: /usr/lib/python-support/python-notify/python2.4/gtk-2.0/pynotify/_pynotify.so
python-notify: /usr/lib/python-support/python-notify/python2.5/gtk-2.0/pynotify/_pynotify.so
wks:/home/sa#

WRITEME

Time Triggered

cron

WRITEME