I've previously mentioned the gate page on Linux, but I thought a little more information might be useful.
The reason why system calls are so slow on x86 goes back to the ingenious but complicated segmentation scheme used by the processor. The original reason for segmentation was to be able to use more than the 16 bits available in a register for an address, as illustrated below.
When x86 moved to 32 bit registers, the segmentation scheme remained but in a different format.
Rather than fixed segment sizes, segments are allowed to be any size. This means the processor needs to keep track of all these different segments and their sizees, which it does descriptors. The segment descriptors available to everyone are kept in the global descriptor table or GDT for short. Each process has a number of registers which point to entries in the GDT; these are the segments the process can access (there are also local descriptor tables, and it all interacts with task state segments, but that's not important now).
The situation is illustrated below.
Since the processor knows what segments of memory the currently running process can access, it can enforce protection and ensure the process doesn't touch anything it is not supposed to. If it does go out of bounds, you receive a segmentation fault, which most programmers are familiar with.
The interesting bit comes when you want to make calls into code that resides in another segment. To implement a secure system, we can give segments a certain permission value. x86 does this with rings, where ring 0 is the highest permission, ring 3 is the lowest, and inner rings can access outer rings but not vice-versa.
Like any good nightclub, once you're inside "club ring 0" you can do anything you want. Consequently there's a bouncer on the door, in the form of a call gate. When ring 3 code wants to jump into ring 0 code, you have to go through the call gate. If you're on the door list, the processor gets bounced to a certain offset of code within the ring 0 segment.
This allows a whole hierarchy of segments and permissions between them. You might have noticed a cross segment call sounds exactly like a system call. If you've ever looked at Linux x86 assembly the standard way to make a system call is int 0x80, which raises interrupt 0x80. An interrupt stops the processor and goes to an interrupt gate, which then works the same as a call gate -- it bounces you off to some other area of code assuming you are allowed.
The problem with this scheme is that it is slow. It takes a lot of effort to do all this checking, and many registers need to be saved to get into the new code. And on the way back out, it all needs to be restored again.
On a modern system, the only thing that really happens with all this fancy segmentation switching is system calls, which essentially switch from mode 3 (userspace) to mode 0 and jump to the system call handler code inside the kernel. Thus sysenter (and sysexit to get back) speed up the whole process over an int 0x80 call by removing the general nature of a far call (i.e. the possibility of going to any segment, with any ring level) and restricting you to going to ring 0 code at a specific segment and offset, as stored in registers.
Because there are so many less options, the whole process can be speed up, and hence we have a fast system call. The other thing to note is that state is not preserved when the kernel gets control. The kernel has to be careful to not to destory state, but it also means it is free to only save as little state as is required to do the job, so can be much more efficient about it. If you're thinking "wow that sounds like insert favourite RISC processor", you're probably right.
The x86 truly is a weird and wonderful machine!