Firewall

Tweets by @markusgattol

Firewall

Status: Netfilter related theory done. Rest partly done or still missing.

Last changed: Thursday 2014-07-10 16:45 UTC

Abstract:

The term firewall is a misleading term. There is no fire. There is not even a wall. All there is, are, for example, electrical signals (pulses) on a wire which are encoded/decoded as so-called bits. Many of them put together make up for a so-called packet, containing bits, the so-called payload. If we chain several packets together, a queue of packets is what we get. Now, if we put a sender and receiver address onto a packet, an IP packet is what we get. At this point we have three types of information -- the one within the packet, the one on the packet, and the information about the queue itself i.e. which packets in what particular order make up the queue. Still, as a matter of fact, in the end, we are talking electrical pulses on the wire, no fire, no wall, no fiction, just simple science. All three types of information can be used to do something with those IP packets e.g. deny/grant access/traversal, mangle, masquerade, etc. So, in the end talking about the term "firewall" actually means talking IP packet inspection and manipulation e.g. changing sender/receiver address, payload, sequence order etc. Again, no fire, no wall, no myths and legend... no dumb sales or marketing chatter either... just deterministic, simple to explain technology. This page will take a look at inspecting/manipulating IP packet streams from an practical angle but will, where necessary and/or appropriate, take a look under the hood of the technology that is being used. We will also take a look at accompanying technologies to IP packet inspection/manipulation like for example port knocking and so-called pro-active approaches which change settings on-the-fly, thereby adapting to current situations and threats.

Table of Contents

Introduction

Types of Firewalls

Firewalling with Linux

Alternatives to Netfilter
Using Netfilter in Conjunction with other Tools/Software

Packet Filtering with Linux

Components

Netfilter
Connection Tracking
xtables

Terms

Installation and Setup

Planning an IP Filter
Prerequisites
Files
Kernel Modules

State Machine

Tables / Chains / Rules

Packet Flow and Relationship between Tables and Chains
The Routing Tables in Detail
User-specified chains
Rules
Matches
Targets and Jumps

Network Address Translation

NAT Use Cases and Terms
Caveats using NAT
Example NAT machine in theory
SNAT
DNAT

Logging

ULOG

Particular Use Cases

A default setup
OpenVZ
MARK Target

Apply Rules

Debugging / Testing

Start Packet Filter at System Boot

/etc/init.d/script_name or /etc/network/interfaces
Configuration

Port Knocking

fwknop

Prerequisites
Install and Configure
Miscellaneous

Authentication

DoS, DDoS

OSI Layer 4
OSI Layer 7

Pro-active Approaches

fail2ban
psad
fwsnort

Miscellaneous

xtables addons

Application Layer

GUIs

fwbuilder

Saving / Restoring Rulesets

Security in a system is made up of layers, firewalling should be the
last to include, once all services have been hardened.
— Debian Security Manual

I decided to give this page the tech level T->2 because it is a mandatory prerequisite for the reader to understand the internet protocol suite with its two most important protocols namely IP (Internet Protocol) and TCP (Transmission Control Protocol) in order for being able to understand this page.

In the course of learning about the internet protocol suite, one will also learn about the OSI (Open Systems Interconnection) standard which is important to know about as well.

Since this page provides a lot of information, those who are looking for a quickstart should follow the links below right away

My approach to firewalling, the overall idea.
Start packet filter at system boot, how it is done
One can find all of my scripts here — packet_filter and generic.sh are needed for firewalling.

Introduction

A firewall (sometimes also known as packet filter; see types of firewalls) is an integrated collection of security measures designed to prevent unauthorized electronic access to a networked computer system.

It is also a device or set of devices configured to permit, deny, encrypt, decrypt, or proxy all computer traffic between different security domains based upon a set of rules and other criteria.

Firewalls can be implemented in both, hardware and software, or a combination of both. Firewalls are frequently used to prevent unauthorized Internet users from accessing private networks connected to the Internet, especially intranets.

All messages entering or leaving the intranet pass through the firewall, which examines each message and blocks those that do not meet the specified security criteria.

From a mere technical point of view, a firewall is a dedicated appliance and/or software running on computer, which inspects network traffic passing through it, and denies or permits passage based on a set of rules.

A firewall's basic task is to regulate some of the flow of traffic between computer networks of different trust levels. Typical examples are the Internet which is a zone with no trust and an internal network which is a zone of higher trust.

A zone with an intermediate trust level, situated between the Internet and a trusted internal network, is often referred to as a perimeter network or DMZ (Demilitarized Zone).

A firewall's function within a network is similar to physical firewalls with fire doors in building construction — it is used to prevent network intrusion to the private network and it is also intended to contain and delay structural fire from spreading to adjacent structures.

Without proper configuration, a firewall can often become worthless. Standard security practices dictate a default-deny (also known as whitelisting) firewall ruleset, in which the only network connections which are allowed are the ones that have been explicitly allowed.

Such configuration requires detailed understanding of the network applications and endpoints required for the day-to-day operation.

Many people responsible for a firewall lack such understanding, and therefore implement a default-allow (blacklisting) ruleset, in which all traffic is allowed unless it has been specifically blocked. This configuration makes inadvertent network connections and system compromise much more likely.

Types of Firewalls

Often people actually refer to a packet filter (e.g. netfilter/iptables) when they talk about a firewall — a packet filter is just one of several component of a firewall!

A firewall usually consists of several building blocks, each one responsible for a particular job. In essence, there are 4 building blocks a firewall can be made of:

A packet filter filters packets on the network layer (OSI layer 3).
Modern packet filters are also capable of stateful filtering which means they also operate on the transport layer (OSI layer 4).
A so-called application layer firewall operates on the application layer (OSI layer 7).
Last but not least, there are proxies. In terms of firewalling however, the boundary between proxies and application layer firewalls is somewhat blurred i.e. the same thing might be referred to as both, proxy and application firewall.

Firewalling with Linux

Since this page is about DebianGNU/Linux it will focus on Linux and its firewalling capability. As of now (April 2011) there is one predominant firewalling framework called netfilter — others like for example HiPAC or nftables are still under heavy development. Basically there are two components with netfilter:

the kernelspace component called netfilter and
the userspace component called iptables

Alternatives to Netfilter

Before we talk about alternatives to netfilter, let us define the overall scope of this page.

We are not talking about central connectivity hubs and backbone technologies as they are found in datacenters with continues data rates in the tens of Gibibits if not Tebibits... that kind of stuff is done with Cisco, Juniper or force10 setups for example e.g. with multi-redundant configurations in mesh topology. Using netfilter here would be like trying to propel an aircraft carrier with a car engine...

What we are talking about is about much smaller continuous data rates (maybe a Gibibit peak from time to time) and less complex setups — setups as they are found at home, at the office and things like that. Using some fat Juniper setup here would be like trying to maneuver an aircraft carrier up the Tames river.

There are always two parameters which are more or less mutual exclusive — flexibility (ease of deployment, changes, etc) and throughput. This is the reason why different layers in networking require different techniques and therefore different gear.

Netfilter can do OSI layer 3 and 4 out of the box and OSI layer 7 with an additional classifier called l7. That fact makes Linux/Netfilter the perfect choice in order to create a reliable, well-tested, incredibly flexible and high-speed firewall for small to mid-sized computer networks.

However, it is just fair to say that big corporate/governmental/etc. networks are equipped with Juniper/Cisco/force10/etc. at the tier-1 level in order to do the firewalling/routing/switching and that netfilter may only be used on the smaller, segmented, subnetworks of those bigger networks.

I am not going into detail here but the reasoning for this is only partially technical but rather things like TCO (Total Cost of Ownership), SLA (Service Level Agreement) considerations and of course the powerful marketing machinery of those dominant companies play a major role.

My personal experience and opinion on the matter of choice whether to use netfilter or non-Netfiler solutions is that any company with 3000 or less employees, which main business is not in IT (Information Technology), can easily use netfilter and run it on some high-end server(s) with redundant components like for example redundant power supply, RAID (Redundancy Arrays of Independent Disks), etc.

This might cost 15000 euros at maximum, starting at a few hundred euros in a minimalistic case. The same solution bought from Juniper or Cisco is mostly 2 to 5 times as expensive, with license costs for their proprietary software making up for a big chunk of it — something that is not true for netfilter because it is FLOSS (Free/Libre Open Source Software).

However, successfully using netfilter with off-the-shelf hardware requires that a company employs some Linux expert too — something I can only strongly recommend if it is only for the simple fact that this person can help planning the companies long-term IT strategy aside from his daily labor.

netfilter is way more flexible and powerful than entry-level Juniper/Cisco/force10/etc. gear but then the main benefit is that those companies provide all-in-one bundles i.e. starting with initial consultation to hardware/software and most importantly, SLA with 24/7 coverage if needed — something that can only be topped by employing a full-time IT expert.

Ultimately, any use case is different and thus requires individual solutions...

Using Netfilter in Conjunction with other Tools/Software

Because of its nature of being FLOSS, there is a bunch of software surrounding netfilter respectively plays into the realm of firewalling on Linux

sa@wks:~$ debtags search *firewall | wc -l
64
sa@wks:~$

Below is a subset of that software I consider most useful/important for firewalling on Linux. We can classify them into three categories — there is software that can be used to control/manage netfilter,

iptables: main administration tool for packet filtering and NAT with netfilter.
ipset: is an administration tool for kernel IP sets.
conntrack: a userspace command line program used to view and manage the in-kernel connection tracking state table.
iptstate: top-like state for netfilter/iptables. It is only useful if netfilter CONNTRACK is enabled in the kernel.
netstat-nat: a tool that displays NAT connections.

Then there is software that works hand in hand with netfilter in order to get a particular job done

nufw: used to authenticate user traffic i.e. allows to write filtering rules based on user identity, in addition to classical network criteria.
fwknop: port knocking in modern SPA (Single Packet Authorization) manner.
psad: detect port scans and take measures like for example dynamically altering netfilter rules, email alert, etc.
fail2ban: dynamically updates netfilter and/or denyhost rules in order to ban IPs that cause multiple authentication errors; also capable of sending alert email.
fwsnort: translates Snort rules into equivalent netfilter rules.

and last but not least, there are optional add-ons to netfilter itself, extending its functionality

l7: a classifier that identifies packets based on application layer data. It can classify packets as Kazaa, HTTP, Jabber, Citrix, Bittorrent, FTP, Gnucleus, eDonkey2000, etc., regardless of port. It complements existing classifiers that match on IP address, port numbers and so on.
ulog, specter: enhanced logging.

Many extensions are included in the base iptables package, such as the iptables extension which allows querying of the connection state mentioned above.

Additional extensions are distributed in the xtables-addons-source package that replaces the older patch-o-matic-ng package. With it, experimental features get tested and possibly later included into netfilter and iptables releases.

Packet Filtering with Linux

We are now going to take a closer look at what is called the Linux packet filtering stack. The image below shows what packet filtering on Linux looks like as of now (June 2009).

As we can see, there are several layers starting with the Linux kernel itself at the bottom and ending with userspace tools at the top. The layer at the top (userspace tools) is used by us to manage/control packet filtering with Linux, ultimately carried out by the netfilter layer respectively the Linux kernel network layer at the bottom.

The intermediate layers like for example Xtables, are, in essence, frameworks that make sense from a technical point of view — they bring structure, modularity and well defined boundaries and interfaces to the whole shebang.

Last but not least, the layers on top the networking layer abstract things out so packet filtering with Linux becomes a doable task even for the non-expert on the matter of networking as described by the OSI standard and all the other RFCs and standards with regards to networking out there.

Components

There is bidirectional (vertically as well as in some cases horizontally) information exchange among the various components in the packet filtering stack.

It is so that a lower layer provides functionality to its layer above which in turn instructs the lower layer to carry out some action on behalf of itself and ultimately the user when the information travels several layers down to the netfilter layer or even deeper into the stack i.e. the Linux networking layer.

Netfilter

There are two things the word netfilter might refer to. Firstly, netfilter is the name of the project that brings firewalling capabilities to Linux. These these capabilities consist of several components — kernel modules, libraries and a set of userspace tools.

Secondly netfilter is the name of the Linux kernel framework that provides a set of hooks within the Linux kernel for intercepting and manipulating network packets.

The best-known components on top of netfilter are those used with regards to firewalling (x_tables, ip_tables the kernel module and iptables the userspace tool), but then netfilter and its hooks are also used by other components in the Linux packet filtering stack which perform NAT (Network Address Translation), stateful packet tracking and packet enqueueing to userspace.

Connection Tracking

One of the important features built on top of the netfilter framework is connection tracking — made possible by using the so-called state machine.

Connection tracking allows the Linux kernel to keep track of all logical network connections or sessions, and thereby relate all of the packets which may make up that connection.

For example, NAT relies on this information to translate all related packets in the same way, and xtables/iptables can use this information to act as a stateful firewall.

Connection tracking classifies each packet as being in a number of different states:

NEW: trying to create a new connection
ESTABLISHED: part of an already-existing connection
RELATED: packet initiating a new connection that is related to, but not actually part of an existing connection
INVALID: not part of an existing connection and
UNTRACKED: not tracked

A normal example would be that the first packet the conntrack subsystem sees will be classified new, the reply would be classified established and an ICMP (Internet Control Message Protocol) error would be related.

An ICMP error packet which did not match any known connection would be invalid. untracked is a special state that can be assigned by the administrator to bypass connection tracking for a particular packet.

Note that the connection state is completely independent of any TCP packet/sequence state. If the host answers with a SYN ACK packet to acknowledge a new incoming TCP (Transmission Control Protocol) connection, the TCP connection itself is not yet established but the tracked connection is i.e. this packet will match the established state. Also, a tracked connection of a stateless protocol like UDP nevertheless has a connection state.

Furthermore, through the use of plugin modules, connection tracking can be given knowledge of application layer protocols and thus understand that two or more distinct connections are related.

For example, consider the FTP (File Transfer Protocol) protocol. A control connection is established, but whenever data is transferred, a separate connection is established to transfer it. When the nf_conntrack_ftp module is loaded, the first packet of an FTP data connection will be classified as related instead of new, as it is logically part of an existing connection.

Iptables can use the connection tracking information to make packet filtering rules more powerful and easier to manage. The conntrack match extension (--ctstate at man 8 iptables) allows iptables rules to examine the connection tracking classification for a packet.

For example, one rule might allow NEW packets only from inside the packet filter to outside, but allow RELATED and ESTABLISHED in either direction. This allows normal reply packets from the outside (ESTABLISHED), but does not allow new connections to come from the outside to the inside. However, if an FTP data connection needs to come from outside the packet filter to the inside, it will be allowed, because the packet will be correctly classified as RELATED to the FTP control connection, rather than a NEW connection.

xtables

The xtables framework in essence is mostly a function collection of C functions that are only available within the Linux kernel — currently it is just an ongoing effort to collapse ebtables, arp_tables, ip_tables and ip6_tables into one collection of functions.

What it does is, it provides us with the possibility to add features to the Linux packet filtering stack on the fly.

To do so, we write kernel modules that register against this framework. Also, depending on the feature's category, we write an iptables userspace module e.g. something involved with logging.

By writing our new extension, we can match, mangle, track and give faith to any given packet or complete flows of interrelated connections (connection tracking ergo stateful that is).

Below is a listing of userspace tools sitting atop their main accompanying kernel modules i.e. the userspace tool iptables sits on top its main kernel module ip_tables which in the end uses x_tables as can be seen

sa@wks:~$ lsmod | grep ^x_tables
x_tables               25736  12 xt_state,iptable_nat,xt_tcpudp,xt_length,ipt_ttl,xt_tcpmss,xt_TCPMSS,xt_multiport,xt_limit,xt_dscp,ipt_REJECT,ip_tables
sa@wks:~$

ebtables

Ebtables is an application program used to set up and maintain the tables containing rules which inspect Ethernet frames.

It is analogous to the iptables userspace tool, but less complicated, due to the fact that the Ethernet protocol is a simpler protocol than the IP protocol is.

arptables

arptables is a userspace tool, used to set up and maintain the tables of ARP (Address Resolution Protocol) rules in the Linux kernel. These rules inspect the ARP frames as they travel through the kernel.

As for ebtables, arptables is analogous to the iptables userspace tool, but less complicated than iptables due to the nature of the address resolution protocol.

iptables

Iptables is commonly used to inclusively refer to the kernel-level component xtables that does the actual table traversal and provides an API for kernel-level extensions.

However, more precisely, iptables is actually name of the userspace tool used to configure and maintain a set of tables and the chains and rules stored within those tables. The tables are provided by the xtables infrastructure/framework, which in turn uses netfilter.

Its main purpose is to create and maintain rules for the packet filtering, both inbound and outbound, as well as to create and maintain rules for NAT.

As for the other userspace tools, iptables requires elevated privileges to operate i.e. it must be executed by the user root. Iptables is installed at /usr/sbin/iptables and documented at man 5 iptables.

ip6tables

Same as iptables but for IPv6.

Terms

We need to know a few basic terms in order to understand this page and being able to communicate about packet filtering:

Connection: This is generally referred to as a series of packets relating to each other. These packets refer to each other as an established kind of connection. A connection is in another word a series of exchanged packets.; In TCP, this mainly means establishing a connection via the 3-way handshake, and then this is considered a connection until the release handshake.
DNAT: Destination Network Address Translation. DNAT refers to the technique of translating the Destination IP address of an IP packet, or to change it simply put.; This is used together with SNAT to allow several hosts to share a single Internet routable IP address, and to still provide Server Services. This is normally done by assigning different ports with an Internet routable IP address, and then tell the Linux router where to send the traffic.
SNAT: Source Network Address Translation also known as masquerading. This refers to translating the Source IP address of an IP packet. It is used to make it possible for several hosts to share a single Internet routable IP address, since there is currently a shortage of available IP addresses in IPv4 (IPv6 will solve this).
Kernelspace: This is more or less the opposite of userspace. This implies the actions that take place within the Linux kernel itself, and not outside of the kernel.
Userspace: With this term we refer to everything and anything that takes place outside the kernel. For example, invoking iptables -h takes place outside the kernel, while iptables -A FORWARD -p tcp -j ACCEPT takes place (partially) within the kernel, since a new rule is added to the ruleset.
Packet: A singular unit sent over a network, containing a header and a data/payload portion. For example, an IP packet or an TCP packet.; In RFC (Request for Comments) a packet is not so generalized, instead IP packets are called datagrams, while TCP packets are called segments. With this page, pretty much everything is called packet for reasons of simplicity.
Segment: A TCP segment is pretty much the same as an packet, but a formalized word for a TCP packet.
QoS: Quality of Service is a way of specifying how a packet should be handled and what kind of service quality it should receive while sending it.
Stream: This term refers to a connection that sends and receives packets that are related to each other in some fashion. Basically, we use this term for any kind of connection that sends two or more packets in both directions.; In TCP this may mean a connection that sends a SYN and then replies with an SYN/ACK, but it may also mean a connection that sends a SYN and then replies with an ICMP Host unreachable i.e. we use this term very loosely.
State: This term refers to which state the packet is in, either according to RFC 793 (TCP), or to userside states used in netfilter/iptables.; Note that, as mentioned before, the used states internally, and externally, do not follow the RFC 793 specification fully. The main reason is that netfilter has to make several assumptions about the connections and packets.
IPSEC: Internet Protocol Security is a protocol used to encrypt IPv4 packets and sending them securely over the Internet.
VPN: Virtual Private Network is a technique used to create virtually private networks over non-private (thus insecure) networks, such as the Internet. IPSEC is one technique used to create VPN connections. OpenVPN is another.

Policy: There are two kinds of policies that we speak about most of the time when implementing a firewall.; First we have the chain policies, which tells the firewall implementation the default behavior to take on a packet if there was no rule that matched it.; The second type of policy is the security policy that we may have written documentation on, for example for the whole company or for this specific network segment. Security policies are very good documents to have thought through properly and to study properly before starting to actually implement the firewall.
Accept: To accept a packet and to let it through. This is the opposite of the drop or deny targets, as well as the reject target.
Drop/Deny: When a packet is dropped or denied, it is simply deleted, and no further actions are taken. No reply to tell the host it was dropped, nor is the receiving host of the packet notified in any way. The packet simply disappears.
Reject: This is basically the same as a drop or deny target or policy, except that we also send a reply to the host sending the packet that was dropped.; The reply may be specified, or automatically calculated to some value. To this date (2009-04-19), there is unfortunately no iptables functionality to also send a packet notifying the receiving host of the rejected packet what happened i.e. , doing the reverse of the REJECT target. This would be very good in certain circumstances, since the receiving host has no ability to stop DoS (Denial of Service) attacks from happening.
State: A specific state of a packet in comparison to a whole stream of packets. For example, if the packet is the first that the firewall sees or knows about, it is considered new (the SYN packet in a TCP connection), or if it is part of an already established connection that the firewall knows about, it is considered to be established. States are known through the connection tracking system, which keeps track of all the sessions.
Chain: A chain contains a ruleset of rules that are applied on packets that traverses the chain. Each chain has a specific purpose (e.g. which table it is connected to, which specifies what this chain is able to do), as well as a specific application area (e.g. only forwarded packets, or only packets destined for this host).
Table: Each table has a specific purpose, and in iptables there are 4 tables. The raw, nat, mangle and filter tables. For example, the filter table is specifically designed to filter packets, while the nat table is specifically designed to NAT packets.
Rule: A rule is a set of a match or several matches together with a single target in most implementations of IP filters, including the iptables implementation. There are some implementations which let us use several targets/actions per rule.
Ruleset: A ruleset is the complete set of rules that are put into a whole IP filter implementation.; In the case of iptables, this includes all of the rules set in the filter, nat, raw and mangle tables, and in all of the subsequent chains. Most of the time, they are written down in a configuration file of some sort.
Match: This word can have two different meanings when it comes to IP filtering:; The first meaning would be a single match that tells a rule that this header must contain this and this information. For example, the --source match tells us that the source address must be a specific network range or host IP address.; The second meaning is if a whole rule is a match. If the packet matches the whole rule, the jump or target instructions will be carried out e.g. the packet will be dropped.
Target: There is generally a target set for each rule in a ruleset. If the rule has matched fully, the target specification tells us what to do with the packet.; For example, if we should drop or accept it, or NAT it, etc. There is also something called a jump specification — there might not be a target or jump for each rule, but there may be.
Jump: The jump instruction is closely related to a target. A jump instruction is written exactly the same as a target in iptables, with the exception that instead of writing a target name, we write the name of another chain. If the rule matches, the packet will hence be sent to this second chain and be processed as usual in that chain.
Connection tracking: A firewall which implements connection tracking is able to track connections/streams simply put. The ability to do so is often done at the impact of lots of processor and memory usage. This is unfortunately true in iptables as well, but much work has been done to work on this.; However, the good thing is that the firewall will be much more secure with connection tracking properly used by the implementer of the firewall policies plus it may reduce complexity a lot e.g. assuming we alow outgoing connections and then we also allow all incoming connections based on whether or not they are related to some already existing connection (the outgoing one i.e. the one we initiated).

Installation and Setup

Before we can start setting up our packet filter, we need to check on a few prerequisites and familiarize ourselves with a few things like for example what kernel modules we need in order to do a particular job or what files are involved in the process and where they live on the filesystem.

Planning an IP Filter

One of the first steps to think about when planning the firewall is their placement. This should be a fairly simple step since mostly our networks should be fairly well segmented anyway.

One of the first places that comes to mind is the gateway between our local network(s) and the Internet. This is a place where there should be fairly tight security. Also, in larger networks it may be a good idea to separate different divisions from each other via firewalls.

For example, why should the development team have access to the human resources network, or why not protect the economic department from other networks? Simply put, we do not want an angry employee with the pink slip tampering with the salary databases.

The above means that we should plan our networks as well as possible, and plan them to be segregated. Especially if the network is medium is not small — 50 workstations or more, based on different aspects of the network.

There are basically two choices here which can be mixed or used standalone:

In between these smaller networks, we try to put firewalls (OSI layer 3 and higher) that will only allow the kind of traffic that we would like or/and
We could use VLANs (OSI layer 2).

It may also be a good idea to create a DMZ (Demilitarized Zone) in case we have servers that are reached from the Internet as well as from the LAN. In essence, a DMZ is a small subnetwork with servers, which is closed down to the extreme.

This lessens the risk of anyone actually getting in to the machines in the DMZ, and even more important, it lessens the risk of anyone getting from those machines in the DMZ into our LAN by either trying to pro-actively getting into the LAN using the machines in the DMZ as a intermediary layer or placing backdoors and trojans on the DMZ machines.

The machines within the DMZ thus are mostly hardened and stuffed with all kinds of IDS (Intrusion Detection System) magic and other nifty stuff like that.

There are a couple of ways to set up the policies and default behaviors in a packet filter, and this section will discuss the actual theory that we should think about before actually starting to implement a packet filter.

Before we start, we should understand that most packet filters respectively firewalls have default behavior. For example, if no rule in a specific chain matches, it can be either dropped or accepted per default. Unfortunately, there is only one policy per chain, but this is often easy to get around if we want to have different policies per network interface etc.

There are two basic policies that we normally use. Either we drop everything except that which we specify (whitelisting), or we accept everything except that which we specifically drop (blacklisting).

Most of the time, we are mostly interested in the drop policy, and then accepting everything that we want to allow specifically. This means that the firewall is more secure per default, but it may also mean that we will have more labor in order to getting our packet filter to operate properly.

Our first decision to make is to simply figure out which type of firewall we should use — whitelisting or blacklisting that is; I always go for whiteliting i.e. drop anything per default.

Next, how big are the security concerns — what are we going to protect with our packet filter? What kind of applications must be able to communicate through the firewall?...

Overall, it considered best practice to apply layered security measures i.e. we should use as many independent security measures as possible/affordable at the same time, and not rely on a single security concept.

For example, we could use a fully fledged, highly secured and redundant, Linux packet filter for our main gateway between the outside world (Internet) and our LANs (maybe including a DMZ) but also harden each workstation. This way we would already introduce two independent security layers and thus boost overall IT security.

In addition to hardening each workstation (such things can be done very effectively using clusterssh, puppet or for example FAI (Fully Automatic Installation) and the like) we could also apply some minimalistic packet filter onto each workstation.

If that is not enough yet, we might go even farther and set up some IDS (Intrusion Detection System) like for example OSSEC and last but not least set up a trustful SSH infrastructure using Monkeysphere.

However, what is utterly important and therefore what should always happen no matter what efforts we make on the technical site, is to educate our users.

Finally, if we are diligent and consequent, we end up what is called a security protocol that describes every possible angle about our security concept. Preferable this is some sort of paper, set up for collaborative work using some CMS (Content Management System) or SCM (Software Configuration Management).

Every person who is serious about IT security maintains such a security protocol, for medium and big corporate structures it is mandatory anyway, governments and their military do so since decades anyway.

One last thing to note is that it is always a good thing to follow standards respectively use software which applies to standards.

As probably many of us have already seen with crappy things like for example Skype or ICQ, if we do not use standardized systems, things can go terribly wrong — Skype and ICQ use their own, proprietary, communication protocols; no one exactly knows how they work.

Instead of Skype folks should use QuteCom or Ekiga and instead of ICQ, folks should go with XMPP (a standardized protocol) simply by using Pidgin (or any other client that does XMPP). Please go here for more information about configuring and using Pidgin.

Prerequisites

There is not really much to do here. All we need is a fairly up to date Linux kernel and at least the iptables userspace tool — xtables-addons-source, arptables, ebtables etc. are all optional. dpl is an alias in my ~/.bashrc by the way.

sa@wks:~$ type dpl
dpl is aliased to `dpkg -l'
sa@wks:~$ dpl *tables* | grep ^ii | egrep -v lib\|dev
ii  arptables       0.0.3.3-1       ARP table administration
ii  ebtables        2.0.8.2-4       Ethernet bridge frame table administration
ii  iptables        1.4.3.2-1       administration tools for packet filtering and
ii  xtables-addons- 1.14-1          Source for the xtables-addons driver
sa@wks:~$ uname -r
2.6.30-1-openvz-amd64
sa@wks:~$

sysctl Settings

WRITEME

Files

There are certain files that we need to know about or that we should at least know about:

/etc/protocols is a list of Internet protocols officially acknoledged by IANA (Internet Assigned Numbers Authority).
/etc/services, /usr/share/nmap/nmap-services or the links to Wikipedia as well as to http://www.graffiti.com/services provide information on port numbers and their usage.
/etc/ini.d/<name_of_shell_script_containing_ruleset> is our main firewalling shell script which we use to store our ruleset. We use update-rc.d in order to add/remove it to/from the various runlevels. More information on the matter can be found here.
/etc/iproute2/rt_realms is used in Linux to classify routes into logical groups of routes.

Kernel Modules

As mentioned above, there are certain modules which are mandatory to be loaded for a packet filter to functions (e.g. x_tables) and then there are those kernel modules which are optional based on what we are trying to accomplish with our packet filter.

A nice example is if we take a look at module dependencies — as we can see, if we wanted to use xt_tcpudp, x_tables gets loaded automatically because it is listed as a dependency to xt_tcpudp.

sa@wks:/lib/modules/2.6.26-2-openvz-amd64$ grep xt_tcpudp modules.dep
kernel/net/netfilter/xt_tcpudp.ko: kernel/net/netfilter/x_tables.ko
sa@wks:/lib/modules/2.6.26-2-openvz-amd64$

We can get alist of available kernel modules using a one-liner as shown below.

wks:/etc/init.d# modprobe -l xt_* | xargs -I {} basename {} | head
xt_realm.ko
xt_connlimit.ko
xt_RATEEST.ko
xt_pkttype.ko
xt_sctp.ko
xt_limit.ko
xt_tcpudp.ko
xt_TCPOPTSTRIP.ko
xt_NFLOG.ko
xt_conntrack.ko
wks:/etc/init.d#

However, right now (April 2009) there are nf_, xt_, ipt_, ip6t_, arp_, arpt_ and ebt_ but that might all change in the future based on the naming scheme the netfilter developers settle with at some point in the future.

Right now, no matter what the prefix is, they all use the xtables framework already anyway.

Another thing to notice is that kernel modules written in uppercase represent targets and those written in lowercase represent matches — xt_RATEEST for example is a target whereas xt_realm is a match.

State Machine

The state machine is a special part within netfilter that should really not be called the state machine at all, since it is really a connection tracking machine.

However, most people recognize it under the name state machine — for us it is only important to know that, in order to do connection tracking, we need to have a state machine.

Connection tracking is done to let the netfilter framework know the state of a specific connection. Firewalls that implement this are generally called stateful firewalls (see types of firewalls). A stateful firewall is generally much more secure than a non-stateful firewalls since it allows us to write much tighter rulesets.

Within netfilter, packets can be related to tracked connections in four different so called states. These are known as

NEW
ESTABLISHED
RELATED and
INVALID

We will discuss each of these in more depth later. With the --state match we can easily control who or what is allowed to initiate new sessions.

All of the connection tracking is done by special framework within the kernel called conntrack. conntrack may be loaded either as a kernel module, or as an internal part of the kernel itself. Most of the time, we need and want more specific connection tracking than the default conntrack engine can maintain.

Because of this, there are also more specific parts of conntrack that handles the TCP, UDP or ICMP protocols among others. These modules grab specific, unique, information from the packets, so that they may keep track of each stream of data.

The information that conntrack gathers is then used to tell conntrack in which state the stream is currently in. For example, UDP streams are, generally, uniquely identified by their destination IP address, source IP address, destination port and source port.

In previous kernels, we had the possibility to turn on and off defragmentation. However, since iptables and netfilter were introduced and connection tracking in particular, this option was gotten rid of. The reason for this is that connection tracking can not work properly without defragmenting packets, and hence defragmenting has been incorporated into conntrack and is carried out automatically. It can not be turned off, except by turning off connection tracking itself i.e. defragmentation is always carried out if connection tracking is turned on.

All connection tracking is handled in the PREROUTING chain, except locally generated packets which are handled in the OUTPUT chain. What this means is that netfilter will do all recalculation of states and so on within the PREROUTING chain.

If we send the initial packet in a stream, the state gets set to NEW within the OUTPUT chain, and when we receive a return packet, the state gets changed in the PREROUTING chain to ESTABLISHED, and so on.

If the first packet is not originated by ourselves, the NEW state is set within the PREROUTING chain of course. So, all state changes and calculations are done within the PREROUTING and OUTPUT chains of the nat table.

The conntrack entries

Let us take a brief look at a conntrack entry and how to read them in /proc/net/ip_conntrack. This gives a list of all the current entries in our conntrack database. If we have the ip_conntrack module loaded, we can check for the current connection tracking status

wks:/home/sa# lsmod | grep conntrack
nf_conntrack_ipv4      24352  0
nf_conntrack           82688  1 nf_conntrack_ipv4
wks:/home/sa# cat /proc/net/ip_conntrack | head -n1
tcp      6 12 SYN_SENT src=192.168.1.4 dst=234.12.87.233 sport=40735 dport=30206 packets=11 bytes=765 [UNREPLIED] src=234.12.87.233 dst=192.168.1.4 sport=30206 dport=40735 packets=8 bytes=667 [ASSURED] mark=0 secmark=0 use=1
wks:/home/sa#

This example contains all the information that the conntrack module maintains to know which state a specific connection is in.

First of all, we have a protocol, which in this case is tcp. Next, the same value in normal decimal coding. After this, we see how long this conntrack entry has to live. This value is set to 12 seconds right now and is decremented regularly until we see more traffic.

This value is then reset to the default value for the specific state that it is in at that relevant point of time. Next comes the actual state that this entry is in at the present point of time. In the above mentioned case we are looking at a packet that is in the SYN_SENT state — the internal value of a connection is slightly different from the ones used externally with netfilter.

The value SYN_SENT tells us that we are looking at a connection that has only seen a TCP SYN packet in one direction. Next, we see the source IP address, destination IP address, source port and destination port. At this point we see a specific keyword that tells us that we have seen no return traffic for this connection. Lastly, we see what we expect of return packets. The information details the source IP address and destination IP address (which are both inverted, since the packet is to be directed back to us). The same thing goes for the source port and destination port of the connection. These are the values that should be of any interest to us.

The connection tracking entries may take on a series of different values, all specified in the conntrack headers available in /usr/src/linux/include/net/netfilter/*.h files. These values are dependent on which sub-protocol of IP we use.

TCP, UDP or ICMP protocols take specific default values as specified in /usr/src/linux/include/net/netfilter/ip_conntrack.h. Also, depending on how this state changes, the default value of the time until the connection is destroyed will also change.

With tcp-window-tracking feature adds all of the above timeouts to special sysctl variables, which means that they can be changed on the fly, while the system is still running. Hence, this makes it unnecessary to recompile the kernel every time we want to change the timeouts.

These can be altered via using specific system calls available in the /proc/sys/net/ipv4/netfilter directory. We should in particular look at the /proc/sys/net/ipv4/netfilter/ip_ct_* variables.

When a connection has seen traffic in both directions, the conntrack entry will erase the [UNREPLIED] flag, and then reset it. The entry that tells us that the connection has not seen any traffic in both directions, will be replaced by the [ASSURED] flag, to be found close to the end of the entry.

The [ASSURED] flag tells us that this connection is assured and that it will not be erased if we reach the maximum possible tracked connections. Thus, connections marked as [ASSURED] will not be erased, contrary to the non-assured connections (those not marked as [ASSURED]).

How many connections that the connection tracking table can hold depends upon a variable that can be set through the ip-sysctl functions in recent kernels. The default value held by this entry varies heavily depending on how much memory we have. On 128 MB of RAM we will get 8192 possible entries, and at 256 MB of RAM, we will get 16376 entries. We can read and set our settings through the

sa@wks:~$ cat /proc/sys/net/ipv4/ip_conntrack_max
65536
sa@wks:~$

variable. A different way of doing this, that is more efficient, is to set the hashsize option to the ip_conntrack module once this is loaded. Under normal circumstances ip_conntrack_max equals 8 * hashsize.

In other words, setting the hashsize to 4096 will result in ip_conntrack_max being set to 32768 conntrack entries. An example of this would be:

wks:/home/sa# modprobe ip_conntrack hashsize=4096
wks:/home/sa# cat /proc/sys/net/ipv4/ip_conntrack_max
32768
wks:/home/sa#

User-land states

As we have seen, packets may take on several different states within the kernel itself, depending on what protocol we are talking about.

However, outside the kernel, we only have the 4 states as described previously. These states can mainly be used in conjunction with the state match which will then be able to match packets based on their current connection tracking state.

The valid states are NEW, ESTABLISHED, RELATED and INVALID. The following list will briefly explain each possible state:

NEW: The NEW state tells us that the packet is the first packet that we see. This means that the first packet that the conntrack module sees, within a specific connection, will be matched. For example, if we see a SYN packet and it is the first packet in a connection that we see, it will match. However, the packet may as well not be a SYN packet and still be considered NEW. This may lead to certain problems in some instances, but it may also be extremely helpful when we need to pick up lost connections from other firewalls, or when a connection has already timed out, but in reality is not closed.
ESTABLISHED: The ESTABLISHED state has seen traffic in both directions and will then continuously match those packets. ESTABLISHED connections are fairly easy to understand. The only requirement to get into an ESTABLISHED state is that one host sends a packet, and that it later on gets a reply from the other host. The NEW state will upon receipt of the reply packet to (or through) the firewall change to the ESTABLISHED state. ICMP reply messages can also be considered as ESTABLISHED, if we created a packet that in turn generated the reply ICMP message.
RELATED: The RELATED state is one of the more tricky states. A connection is considered RELATED when it is related to another already ESTABLISHED connection. What this means, is that for a connection to be considered as RELATED, we must first have a connection that is considered ESTABLISHED. The ESTABLISHED connection will then spawn a connection outside of the main connection. The newly spawned connection will then be considered RELATED, if the conntrack module is able to understand that it is RELATED. Some good examples of connections that can be considered as RELATED are the FTP-data session that are considered RELATED to the FTP control session, and the DCC (Direct Client-to-Client) connections issued through IRC (Internet Relay Chat). This could be used to allow ICMP error messages, FTP transfers and DCC's to work properly through the firewall. Do note that most TCP protocols and some UDP protocols that rely on this mechanism are quite complex and send connection information within the payload of the TCP or UDP data segments, and hence require special helper modules to be correctly understood.
INVALID: The INVALID state means that the packet cannot be identified or that it does not have any state. This may be due to several reasons, such as the system running out of memory or ICMP error messages that do not respond to any known connections. Generally, it is a good idea to DROP everything in this state.
UNTRACKED: This is the UNTRACKED state. In brief, if a packet is marked within the raw table with the NOTRACK target, then that packet will show up as UNTRACKED in the state machine. This also means that all RELATED connections will not be seen, so some caution must be taken when dealing with the UNTRACKED connections since the state machine will not be able to see related ICMP messages etc.

These states can be used together with the --state match to match packets based on their connection tracking state. This is what makes the state machine so incredibly strong and efficient for our packet filter. Previously, we often had to open up all ports above 1024 to let all traffic back into our LAN again. With the state machine in place this is not necessary any longer, since we can now just open up the packet filter for return traffic and not for all kinds of other traffic.

TCP connections

In this section and the upcoming ones, we will take a closer look at the states and how they are handled for each of the three basic protocols TCP, UDP and ICMP.

Also, we will take a closer look at how connections are handled per default, if they cannot be classified as either of these three protocols. We have chosen to start out with the TCP protocol since it is a stateful protocol in itself, and has a lot of interesting details with regard to the state machine in netfilter.

A TCP connection is always initiated with the 3-way handshake, which establishes and negotiates the actual connection over which data will be sent. The whole session is begun with a SYN packet, then a SYN/ACK packet and finally an ACK packet to acknowledge the whole session establishment. At this point the connection is established and able to start sending data. The big problem is, how does connection tracking hook up into this? Quite simply really.

As far as the user is concerned, connection tracking works basically the same for all connection types. Have a look at the picture below to see exactly what state the stream enters during the different stages of the connection.

As we can see, the connection tracking code does not really follow the flow of the TCP connection, from the users viewpoint. Once it has seen one packet (the SYN), it considers the connection as NEW. Once it sees the return packet (SYN/ACK), it considers the connection as ESTABLISHED.

If we think about this a second, we will understand why. With this particular implementation, we can allow NEW and ESTABLISHED packets to leave our LAN, only allow ESTABLISHED connections back, and that will work perfectly.

Conversely, if the connection tracking machine were to consider the whole connection establishment as NEW, we would never really be able to stop outside connections to our LAN, since we would have to allow NEW packets back in again.

To make things more complicated, there are a number of other internal states that are used for TCP connections inside the kernel, but which are not available for us from userspace. Roughly, they follow the state standards specified within RFC 793.

As we can see, it is really quite simple, seen from the user's point of view. However, looking at the whole construction from the kernel's point of view, it is a little more difficult. Let us look at an example.

Consider exactly how the connection states change in the /proc/net/ip_conntrack table. The first state is reported upon receipt of the first SYN packet in a connection.

tcp      6 117 SYN_SENT src=192.168.1.5 dst=192.168.1.35 sport=1031 dport=23 [UNREPLIED] src=192.168.1.35 dst=192.168.1.5 sport=23 dport=1031 use=1

As we can see from the above entry, we have a precise state in which a SYN packet has been sent, (the SYN_SENT flag is set), and to which as yet no reply has been sent (witness the [UNREPLIED] flag). The next internal state will be reached when we see another packet in the other direction.

tcp      6 57 SYN_RECV src=192.168.1.5 dst=192.168.1.35 sport=1031 dport=23 src=192.168.1.35 dst=192.168.1.5 sport=23 dport=1031 use=1

Now we have received a corresponding SYN/ACK in return. As soon as this packet has been received, the state changes once again, this time to SYN_RECV. SYN_RECV tells us that the original SYN was delivered correctly and that the SYN/ACK return packet also got through the firewall properly.

Moreover, this connection tracking entry has now seen traffic in both directions and is hence considered as having been replied to. This is not explicit, but rather assumed, as was the [UNREPLIED] flag above. The final step will be reached once we have seen the final ACK in the 3-way handshake.

tcp      6 431999 ESTABLISHED src=192.168.1.5 dst=192.168.1.35 sport=1031 dport=23 src=192.168.1.35 dst=192.168.1.5 sport=23 dport=1031 [ASSURED] use=1

In the last example, we have gotten the final ACK in the 3-way handshake and the connection has entered the ESTABLISHED state, as far as the internal mechanisms of netfilter are aware. Normally, the stream will be ASSURED by now.

A connection may also enter the ESTABLISHED state, but not be [ASSURED]. This happens if we have connection pickup turned on (this requires the tcp-window-tracking, and the ip_conntrack_tcp_loose to be set to 1 or higher). The default, without the tcp-window-tracking, is to have this behavior, and is not changeable.

When a TCP connection is closed down, it is done in the following way and takes the following states.

As we can see, the connection is never really closed until the last ACK is sent. Do note that this picture only describes how it is closed down under normal circumstances. A connection may also, for example, be closed by sending a RST (reset), if the connection were to be refused. In this case, the connection would be closed down immediately.

When the TCP connection has been closed down, the connection enters the TIME_WAIT state, which is per default set to 2 minutes. This is used so that all packets that have gotten out of order can still get through our rule-set, even after the connection has already closed.

This is used as a kind of buffer time so that packets that have gotten stuck in one or another congested router can still get to the firewall, or to the other end of the connection.

If the connection is reset by a RST packet, the state is changed to CLOSE. This means that the connection per default has 10 seconds before the whole connection is definitely closed down.

RST packets are not acknowledged in any sense, and will break the connection directly. There are also other states than the ones we have told discussed so far.

Below is the complete list of possible states that a TCP stream may take, and their timeout values (format is <state>: <timeout>):

NONE: 30 minutes
ESTABLISHED: 5 days
SYN_SENT: 2 minutes
SYN_RECV: 60 seconds
FIN_WAIT: 2 minutes
TIME_WAIT: 2 minutes
CLOSE: 10 seconds
CLOSE_WAIT: 12 hours
LAST_ACK: 30 seconds
LISTEN: 2 minutes

These values are most definitely not absolute. They may change with kernel revisions, and they may also be changed via the proc file-system in the /proc/sys/net/ipv4/netfilter/ip_ct_tcp_* variables.

The default values should, however, be fairly well established in practice. These values are set in seconds.

Also note that the userspace side of the state machine does not look at TCP flags (i.e. RST, ACK, and SYN are flags) set in the TCP packets. This is generally bad, since we may want to allow packets in the NEW state to get through the firewall, but when we specify the NEW flag, we will in most cases mean SYN packets.

This is not what happens with the current state implementation — instead, even a packet with no bit set or an ACK flag, will count as NEW. This can be used for redundant firewalling and so on, but it is generally extremely bad on our home network, where we only have a single firewall.

To get around this behavior, we could use the tcp-window-tracking feature, and set /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_loose to zero, which will make the packet filter drop all NEW packets with anything but the SYN flag set.

UDP connections

UDP connections are in themselves not stateful connections, but rather stateless. There are several reasons why, mainly because they do not contain any connection establishment or connection closing — most of all they lack sequencing.

Receiving two UDP datagrams in a specific order does not say anything about the order in which they were sent. It is, however, still possible to set states on the connections within the kernel. Let us have a look at how a connection can be tracked and how it might look in conntrack.

As we can see, the connection is brought up almost exactly in the same way as a TCP connection. That is, from the user point of view.

Internally, conntrack information looks quite a bit different, but intrinsically the details are the same. First of all, let us have a look at the entry after the initial UDP packet has been sent:

udp      17 20 src=192.168.1.2 dst=192.168.1.5 sport=137 dport=1025 [UNREPLIED] src=192.168.1.5 dst=192.168.1.2 sport=1025 dport=137 use=1

As we can see from the first and second values, this is an UDP packet. The first is the protocol name, and the second is protocol number.

This is just the same as for TCP connections. The third value marks how many seconds this state entry has to live. After this, we get the values of the packet that we have seen and the future expectations of packets over this connection reaching us from the initiating packet sender.

These are the source, destination, source port and destination port. At this point, the [UNREPLIED] flag tells us that there has so far been no response to the packet. Finally, we get a brief list of the expectations for returning packets. Do note that the latter entries are in reverse order to the first values. The timeout at this point is set to 30 seconds, as per default:

udp      17 170 src=192.168.1.2 dst=192.168.1.5 sport=137 dport=1025 src=192.168.1.5 dst=192.168.1.2 sport=1025 dport=137 [ASSURED] use=1

At this point the server has seen a reply to the first packet sent out and the connection is now considered as ESTABLISHED. This is not shown in the connection tracking, as we can see. The main difference is that the [UNREPLIED] flag has now gone.

Moreover, the default timeout has changed to 180 seconds — but in this example that is by now been decremented to 170 seconds — in 10 seconds time, it will be 160 seconds.

There is one thing that is missing, though, and can change a bit, and that is the [ASSURED] flag described above. For the [ASSURED] flag to be set on a tracked connection, there must have been a legitimate reply packet to the NEW packet

udp      17 175 src=192.168.1.5 dst=195.22.79.2 sport=1025 dport=53 src=195.22.79.2 dst=192.168.1.5 sport=53 dport=1025 [ASSURED] use=1

At this point, the connection has become assured. The connection looks exactly the same as the previous example. If this connection is not used for 180 seconds, it times out. 180 Seconds is a comparatively low value, but should be sufficient for most use. This value is reset to its full value for each packet that matches the same entry and passes through the packet filter, just the same as for all of the internal states.

ICMP connections

ICMP packets are far from a stateful stream, since they are only used for controlling and should never establish any connections.

There are four ICMP types that will generate return packets however, and these have 2 different states. These ICMP messages can take the NEW and ESTABLISHED states. The ICMP types we are talking about are Echo request and Echo reply, Timestamp request and Timestamp reply, =Information request and =Information reply= and finally Address mask request and Address mask reply.

Out of these, the timestamp request and information request are obsolete and could most probably just be dropped. However, the Echo messages are used in several setups such as pinging hosts. Address mask requests are not used often, but could be useful at times and worth allowing. To get an idea of how this could look, have a look at the following image.

As we can see in the above picture, the host sends an echo request to the target, which is considered as NEW by the firewall. The target then responds with a echo reply which the firewall considers as state ESTABLISHED.

When the first echo request has been seen, the following state entry goes into the ip_conntrack.

icmp     1 25 src=192.168.1.6 dst=192.168.1.10 type=8 code=0 id=33029 [UNREPLIED] src=192.168.1.10 dst=192.168.1.6 type=0 code=0 id=33029 use=1

This entry looks a little bit different from the standard states for TCP and UDP as we can see. The protocol is there, and the timeout, as well as source and destination addresses.

The problem comes after that however. We now have 3 new fields called type, code and id. They are not special in any way, the type field contains the ICMP type and the code field contains the ICMP code. The final id field, contains the ICMP ID.

Each ICMP packet gets an ID set to it when it is sent, and when the receiver gets the ICMP message, it sets the same ID within the new ICMP message so that the sender will recognize the reply and will be able to connect it with the correct ICMP request.

The next field, we once again recognize as the [UNREPLIED] flag, which we have seen before. Just as before, this flag tells us that we are currently looking at a connection tracking entry that has seen only traffic in one direction.

Finally, we see the reply expectation for the reply ICMP packet, which is the inversion of the original source and destination IP addresses. As for the type and code, these are changed to the correct values for the return packet, so an echo request is changed to echo reply and so on. The ICMP ID is preserved from the request packet.

The reply packet is considered as being ESTABLISHED, as we have already explained. However, we can know for sure that after the ICMP reply, there will be absolutely no more legal traffic in the same connection. For this reason, the connection tracking entry is destroyed once the reply has traveled all the way through the netfilter structure.

In each of the above cases, the request is considered as NEW, while the reply is considered as ESTABLISHED. Let us consider this more closely. When the firewall sees a request packet, it considers it as NEW. When the host sends a reply packet to the request it is considered ESTABLISHED.

Note that this means that the reply packet must match the criterion given by the connection tracking entry to be considered as established, just as with all other traffic types.

ICMP requests have a default timeout of 30 seconds, which we can change in the /proc/sys/net/ipv4/netfilter/ip_ct_icmp_timeout entry. This should in general be a good timeout value, since it will be able to catch most packets in transit.

Another hugely important part of ICMP is the fact that it is used to tell the hosts what happened to specific UDP and TCP connections or connection attempts.

For this simple reason, ICMP replies will very often be recognized as RELATED to original connections or connection attempts. A simple example would be the ICMP Host unreachable or ICMP Network unreachable. These should always be spawned back to our host if it attempts an unsuccessful connection to some other host, but the network or host in question could be down, and hence the last router trying to reach the site in question will reply with an ICMP message telling us about it. In this case, the ICMP reply is considered as a RELATED packet. The following picture should explain how it would look.

In the above example, we send out a SYN packet to a specific address. This is considered as a NEW connection by the packet filter. However, the network the packet is trying to reach is unreachable, so a router returns a network unreachable ICMP error to us.

The connection tracking code can recognize this packet as RELATED. thanks to the already added tracking entry, so the ICMP reply is correctly sent to the client which will then hopefully abort. Meanwhile, the firewall has destroyed the connection tracking entry since it knows this was an error message.

The same behavior as above is experienced with UDP connections if they run into any problem like the above. All ICMP messages sent in reply to UDP connections are considered as RELATED. Consider the following image.

This time an UDP packet is sent to the host. This UDP connection is considered as NEW. However, the network is administratively prohibited by some firewall or router on the way over.

Hence, our packet filter receives a ICMP Network Prohibited in return. The packet filer knows that this ICMP error message is related to the already opened UDP connection and sends it as a RELATED packet to the client.

At this point, the packet filter destroys the connection tracking entry, and the client receives the ICMP message and should hopefully abort.

Default connections

In certain cases, the conntrack machine does not know how to handle a specific protocol. This happens if it does not know about that protocol in particular, or does not know how it works.

In these cases, it goes back to a default behavior. The default behavior is used on, for example, NETBLT, MUX and EGP.

This behavior looks pretty much the same as the UDP connection tracking. The first packet is considered NEW, and reply traffic and so forth is considered ESTABLISHED.

When the default behavior is used, all of these packets will attain the same default timeout value. This can be set via the /proc/sys/net/ipv4/netfilter/ip_ct_generic_timeout variable.

The default value here is 600 seconds, or 10 minutes. Depending on what traffic we are trying to send over a link that uses the default connection tracking behavior, this might need changing. Especially if we are bouncing traffic through satellites and such, which can take a long time.

Untracked connections and the raw table

UNTRACKED is a rather special keyword when it comes to connection tracking in Linux. Basically, it is used to match packets that has been marked in the raw table not to be tracked.

The raw table was created specifically for this reason. In this table, we set a NOTRACK mark on packets that we do not wish to track in netfilter.

Notice how I say packets, not connection, since the mark is actually set for each and every packet that enters. Otherwise, we would still have to do some kind of tracking of the connection to know that it should not be tracked.

As we have already stated, conntrack and the state machine is rather resource hungry. For this reason, it might sometimes be a good idea to turn off connection tracking and the state machine.

One example would be if we have a heavily trafficked router that we want to firewall the incoming and outgoing traffic, but not the routed traffic.

We could then set the NOTRACK mark on all packets not destined for the firewall itself by ACCEPT ing all packets with destination of our host in the raw table, and then set the NOTRACK for all other traffic.

This would then allow us to have stateful matching on incoming traffic for the router itself, but at the same time save processing power from not handling all the crossing traffic.

Another example when NOTRACK can be used is if we have a highly trafficked web server and want to do stateful tracking, but do not want to waste processing power on tracking the web traffic.

We could then set up a rule that turns of tracking for port 80 on all the locally owned IP addresses, or the ones that are actually serving web traffic.

We could then enjoy statefull tracking on all other services, except for webtraffic which might save some processing power on an already overloaded system.

There is however some problems with NOTRACK that we must take into consideration. If a whole connection is set with NOTRACK, then we will not be able to track related connections either, conntrack and nat helpers will simply not work for untracked connections, nor will related ICMP errors do i.e. we will have to open up for these manually.

When it comes to complex protocols such as FTP and SCTP etc., this can be very hard to manage. As long as we are aware of this, we should be able to handle it however.

Complex protocols and connection tracking

Certain protocols are more complex than others. What this means when it comes to connection tracking, is that such protocols may be harder to track correctly.

Good examples of these are the ICQ, IRC and FTP protocols. Each and every one of these protocols carries information within the actual data payload of the packets, and hence requires special connection tracking helpers to enable it to function correctly.

The complex protocols that have support inside the linux kernel are FTP (File Transfer Protocol), IRC (Internet Relay Chat) and TFTP (Trivial File Transfer Protocol).

Let us take the FTP protocol as the first example. The FTP protocol first opens up a single connection that is called the FTP control session.

When we issue commands through this session, other ports are opened to carry the rest of the data related to that specific command. These connections can be done in two ways, either actively or passively.

When a connection is done actively, the FTP client sends the server a port and IP address to connect to. After this, the FTP client opens up the port and the server connects to that specified port from a random unprivileged port (>1024) and sends the data over it.

The problem here is that the packet filter will not know about these extra connections, since they were negotiated within the actual payload of the protocol data.

Because of this, the firewall will be unable to know that it should let the server connect to the client over these specific ports.

The solution to this problem is to add a special helper to the connection tracking module which will scan through the data in the control connection for specific syntaxes and information.

When it runs into the correct information, it will add that specific information as RELATED and the server will be able to track the connection, thanks to that RELATED entry. Consider the following picture to understand the states when the FTP server has made the connection back to the client.

Passive FTP works the opposite way. The FTP client tells the server that it wants some specific data, upon which the server replies with an IP address to connect to and at what port.

The client will, upon receipt of this data, connect to that specific port, from its own port 20 (the FTP-data port), and get the data in question.

If we have an FTP server behind our package filter, we will require this module in addition to our standard netfilter modules to let clients on the Internet connect to the FTP server properly. The same goes if we are extremely restrictive to our users, and only want to let them reach HTTP and FTP servers on the Internet and block all other ports. Consider the following image and its bearing on Passive FTP.

Some conntrack helpers are already available within the kernel itself. More specifically, the FTP and IRC protocols have conntrack helpers as of writing this. If we can not find the conntrack helpers that we need within the kernel itself, we should have a look at the xtables-addons package.

The xtables-addons tree may contain more conntrack helpers, such as for the ntalk or H.323 protocols. If they are not available in the xtables-addons tree, we have a number of options.

Either we can look at the CVS source of netfilter, if it has recently gone into that tree, or we can contact the netfilter-devel mailing list and ask if it is available.

Conntrack helpers may either be statically compiled into the kernel, or be available as kernel module

wks:/home/sa# modprobe -l *conntrack* | xargs -I {} basename {}
nf_conntrack_proto_sctp.ko
nf_conntrack_netlink.ko
nf_conntrack_h323.ko
nf_conntrack_tftp.ko
nf_conntrack_irc.ko
xt_conntrack.ko
nf_conntrack_ftp.ko
nf_conntrack_netbios_ns.ko
nf_conntrack_amanda.ko
nf_conntrack_sane.ko
nf_conntrack_proto_udplite.ko
nf_conntrack_pptp.ko
nf_conntrack.ko
nf_conntrack_proto_gre.ko
nf_conntrack_sip.ko
nf_conntrack_proto_dccp.ko
nf_conntrack_ipv6.ko
nf_conntrack_ipv4.ko
wks:/home/sa#

If they are compiled as modules, we can load them using modprobe as shown below

wks:/home/sa# modprobe ip_conntrack_ftp
wks:/home/sa# lsmod | grep conntrack
nf_conntrack_amanda     8832  0
nf_conntrack_irc       10680  0
nf_conntrack_ftp       12728  0
nf_conntrack           82688  3 nf_conntrack_amanda,nf_conntrack_irc,nf_conntrack_ftp
wks:/home/sa#

Do note that connection tracking has nothing to do with NAT, and hence we may require more modules if we are NAT'ing connections as well.

For example, if we want to NAT and track FTP connections, we would need the NAT module as well. As of now (April 2009), all NAT helpers start with nf_nat_ and follow that naming convention i.e. the FTP NAT helper would be named nf_nat_ftp and the IRC module would be named nf_nat_irc. The conntrack helpers follow the same naming convention, and hence the IRC conntrack helper would be named nf_conntrack_irc, while the FTP conntrack helper would be named nf_conntrack_ftp.

Tables / Chains / Rules

The xtables framework, used by the modules ip_tables, ip6_tables, arp_tables and ebtables allows us to define tables containing chains of rules for the treatment of packets.

The tables we know are filter, nat, mangle and raw. Each table is associated with a different kind of packet processing. Packets are processed by traversing the chains, rule by rule. A rule in a chain can send a packet to another chain, and this can be repeated to whatever level of nesting is desired.

Every network packet arriving at or leaving from the computer traverses at least one chain. The source of the packet determines which chain it traverses initially.

There are three predefined chains (INPUT, OUTPUT, and FORWARD) in the filter table. Predefined chains have a default policy, for example DROP, which is applied to the packet if it reaches the end of the chain.

However, we can create as many other custom chains as desired. These chains have no default policy i.e. if a packet reaches the end of the chain it is returned to the chain which called it. A chain may be empty.

Each rule in a chain contains the specification of which packets it matches. It may also contain a target. As a packet traverses a chain, each rule in turn examines it. If a rule does not match the packet, the packet is passed on to the next rule.

If a rule does match the packet, the rule takes the action indicated by the target, which may result in the packet being allowed to continue along the chain or not. The packet continues to traverse the chain until either

a rule matches the packet and decides the ultimate fate of the packet e.g. by calling a target like for example ACCEPT, DROP, QUEUE etc. or
a rule calls the RETURN target, in which case processing returns to the calling chain or
the end of the chain is reached

Packet Flow and Relationship between Tables and Chains

It is now time to take a look at the actual packet flow and how the pieces (tables and chains) actually fit together.

When packets travel/traverse through a packet filter there is a certain order applied to that process. The image below is going to help us get to grips with the whole routing/filtering/mangling/nating shebang that may happen to an IP packet as it traverses through a packet filter.

When an IP packet first enters the packet filter, it hits the hardware and then gets passed on to the proper device driver in the kernel.

Then the packet starts to travel through a series of steps in the Linux kernel, before it is either sent to the correct application (i.e. a local process) running on the local machine, or it is forwarded to another machine.

Let us now take a look at the three major cases that might happen with regards to packet filtering/routing:

inbound i.e. Source: non-local (e.g. Internet), Destination: local process
outbound i.e. Source: local process, Destination: outgoing (e.g. to the Internet)
forwarding i.e. Source: non-local, Destination: non-local

Inbound: Source = non-local, Destination = Local Process

First, let us have a look at a packet that is destined for our local machine i.e. a local process. It would pass through the following steps before it is actually being delivered to our application that receives it:

On the wire e.g. Internet.
Comes in on the interface e.g. eth0.
table: raw, chain: PREROUTING. This chain is used to handle packets before the connection tracking takes place. It can be used to set a specific connection not to be handled by the connection tracking code for example.
Next the connection tracking takes place.
table: mangle, chain: PREROUTING. This chain is normally used for mangling packets, e.g. changing TOS (Type of Services) and so on.
table: nat, chain: PREROUTING. This chain is used for DNAT mainly. We should avoid filtering in this chain since it will be bypassed in certain cases.
Next a routing decision takes place, i.e. is the packet destined for our local machine or to be forwarded. It is also important to note, that part of concluding a routing decision is doing ingress filtering and/or making QoS (Quality of Service) routing decisions.
table: mangle, chain: INPUT. At this point, the mangle INPUT chain is hit. We use this chain to mangle packets, after they have been routed, but before they are actually sent to the local process on our local machine.
table: filter, chain: INPUT. This is where we do filtering for all incoming traffic destined for our local machine. Note that all incoming packets destined for this machine pass through this chain, no matter what interface or in which direction they came from.
Local process receives the IP packet.

Outbound: Source = Local Process, Destination = Outgoing

Now we look at the outgoing packets from our own local host and what steps they go through.

Local process creates an IP packet, sends it to the Linux kernel network stack for processing.
The IP packet is given a source address, the outgoing interface to use, and other necessary information that needs to be gathered. Based up that information, a routing decision is made.
table: raw, chain: OUTPUT. This is where we do work before the connection tracking takes place for locally generated packets — we can mark connections so that they will not be tracked for example.
This is where the connection tracking takes place for locally generated packets, for example state changes etc.
table: mangle, chain: OUTPUT. This is where we mangle packets, it is suggested that we do not filter in this chain since it can have side effects.
table: nat, chain: OUTPUT. This chain can be used to NAT outgoing packets from the firewall itself.
Routing decision, since the previous mangle and nat changes may have changed how the packet should be routed.
table: filter, chain: OUTPUT. This is where we filter packets going out from the local machine.
table: mangle, chain: POSTROUTING. The POSTROUTING chain in the mangle table is mainly used when we want to do mangling on packets before they leave our machine, but after the actual routing took place. This chain will be hit by both packets just traversing the firewall, as well as packets created by the firewall itself.
table: nat, chain: POSTROUTING. This is where we do SNAT. It is suggested that we do not do filtering here since it can have side effects, and certain packets might slip through even though we set a default policy of DROP.
The IP packet goes out on some interface e.g. eth0. We might also do egress filtering at this point which certainly is a good idea in case our firewall is the gateway for one or several LANs.
It is now on the wire e.g. Internet.

Forwarding: Source = non-local, Destination = non-local

Now we are assuming that the packet is destined for another machine on another network. The packet goes through the different steps in the following fashion:

On the wire e.g. Internet.
IP packet comes in on the interface e.g. eth0.
table: raw, chain: PREROUTING. Here we can set a connection to not be handled by the connection tracking system.
This is where the non-locally generated connection tracking takes place.
table: mangle, chain: PREROUTING. This chain is normally used for mangling packets e.g. changing TOS and so on.
table: nat, chain: PREROUTING. This chain is used for DNAT mainly. SNAT is done further on. We should avoid filtering in this chain since it will be bypassed in certain cases.
Routing decision takes place i.e. is the packet destined for our local machine or to be forwarded and where. It is also important to note, that part of concluding a routing decision is doing ingress filtering and/or making QoS (Quality of Service) routing decisions.
table: mangle, chain: FORWARD. The packet is then sent on to the FORWARD chain of the mangle table. This can be used for very specific needs, where we want to mangle the packets after the initial routing decision, but before the last routing decision is made, just before the packet is sent out.
table: filter, chain: FORWARD The packet gets routed onto the FORWARD chain. Only forwarded packets go through here, and here we do all the filtering. Note that all traffic that is forwarded goes through here (i.e. not only incoming), so we need to think about it when writing our rule-set. We should not use the INPUT chain to filter on in the current forwarding scenario! INPUT is meant solely for packets to our local machine that do not get routed to any other destination.
table: mangle, chain: POSTROUTING. This chain is used for specific types of packet mangling that we wish to take place after all kinds of routing decisions have been made, but still on this machine.
table: nat, chain: POSTROUTING. This chain should first and foremost be used for SNAT. Again, we should avoid doing filtering here, since certain packets might pass this chain without ever hitting it. This is also where masquerading is done.
Finally the IP packet goes out on the outgoing interface e.g. eth1. We might also do egress filtering at this point which certainly is a good idea in case our firewall is the gateway for one or several LANs.
Out on the wire again e.g. LAN.

As we can see, there are quite a lot of steps to pass through. The packet can be stopped at any of the iptables chains, or anywhere else if it is malformed. However, we are mainly interested in the packet filtering aspect right now.

Do note that there are no specific chains or tables for different interfaces or anything like that. FORWARD is always passed by all packets that are forwarded over this firewall/router.

The Routing Tables in Detail

Now is the right time to take a closer look at the four routing tables, why they exist, what they do in particular and what their relationship is.

raw

The raw table and its chains are used before any other tables in netfilter. For this table to work, the iptable_raw module must be loaded. It will be loaded automatically if iptables is run with the -t raw keyword, and if the module is available.

raw is mainly only used for one thing, and that is to set a mark on packets that they should not be handled by the connection tracking system. This is done by using the NOTRACK target on the packet. If a connection is hit with the NOTRACK target, then conntrack will simply not track the connection.

This has been impossible to solve without adding a new table, since none of the other tables are called until after conntrack has actually been run on the packets, and been added to the conntrack tables, or matched against an already available connection.

This table is rather new and is only available, if compiled, with late 2.6 kernels and later. The raw table contains two chains. The PREROUTING and OUTPUT chain, where they will handle IP packets before they hit any of the other netfilter subsystems.

The PREROUTING chain can be used for all incoming packets to this machine, or that are forwarded, while the OUTPUT chain can be used to alter the locally generated packets before they hit any of the other netfilter subsystems.

mangle

This table is used mainly for mangling packets (by using mangle targets). Among other things, we can change the contents of different packets and that of their headers. We are strongly advised not to use this table for any filtering — nor will any DNAT, SNAT or masquerading work in this table.

Examples of this would be to change the TTL (Time to Live), TOS or MARK. Note that the MARK is not really a change to the packet, but a mark value for the packet is set in kernelspace. Other rules or programs might use this mark further along in the firewall to filter or do advanced routing on — tc or ip are examples.

The table consists of five built-in chains namely PREROUTING, POSTROUTING, OUTPUT, INPUT and FORWARD:

PREROUTING is used for altering packets just as they enter the packet filter and before they hit the routing decision.

POSTROUTING is used to mangle packets just after all routing decisions have been made.

OUTPUT is used for altering locally generated packets after they enter the routing decision.

INPUT is used to alter packets after they have been routed to the local computer itself, but before the userspace application actually sees the data.

FORWARD is used to mangle packets after they have hit the first routing decision, but before they actually hit the last routing decision.

Note that mangle can not be used for any kind of NAT or masquerading, the nat table was made for these kinds of operations.

nat

The nat table is used mainly for NAT. NATed packets get their IP addresses altered, according to our rules. Packets in a stream only traverse this table once.

We assume that the first packet of a stream is allowed. The rest of the packets in the same stream are then automatically NATed, and will be subject to the same actions as the first packet.

These will, in other words, not go through this table again, but will nevertheless be treated like the first packet in the stream. This is the main reason why we should not do any filtering in this table.

The PREROUTING chain is used to alter packets as soon as they get come into the package filter. The OUTPUT chain is used for altering locally generated packets (i.e. on the package filter itself) before they get to the routing decision.

Finally we have the POSTROUTING chain which is used to alter packets just as they are about to leave the firewall.

filter

The filter table should be used exclusively for filtering packets. For example, we could DROP, LOG, ACCEPT or REJECT packets without problems, as we can in the other tables.

With the filter table, we can match packets and filter them in whatever way we want. This is the place that we actually take action against packets and look at what they contain and either deny or permit them, depending on their content.

Almost all targets are usable in this table — this table is the right place to do your main filtering.

There are three chains built in to this table. The first one is named FORWARD and is used on all non-locally generated packets that are not destined for our local machine. INPUT is used on all packets that are destined for our local machine (the package filter itself) and OUTPUT is finally used for all locally generated packets.

User-specified chains

If an IP packet enters a chain such as the INPUT chain in the filter table, we can specify a jump rule to a different chain within the same table.

The new chain must be user-specified, it may not be a built-in chain such as the INPUT or FORWARD chain for example. If we consider a pointer pointing at the rule in the chain to execute, the pointer will go down, rule by rule, from top to bottom (there is no such thing like goto) until the chain traversal is either ended by a target or the main chain (i.e. FORWARD, INPUT, etc.) ends. Once this happens, the default policy of the built-in chain will be applied.

If one of the rules that matches points to another userspecified chain in the jump specification, the pointer will jump over to this chain and then start traversing that chain from the top to bottom.

For example, see how the rule execution jumps from rule number 3 to chain 2 in the above image. The IP packet matched the matches contained in rule 3, and the jump/target specification was set to send the packet on for further examination in chain 2.

User-specified chains cannot have a default policy i.e. -P <name_of_user-specified_chain> DROP for example will not work. Only built-in chains (FORWARD, OUTPUT, INPUT, PREROUTING and POSTROUTING) have default policies.

However, this can be circumvented by appending a single rule at the end of the user-specified chain, which has no matches, and hence it will behave as a default policy. Best practices however is to really only use default policies with built-in chains!

If no rule is matched in a userspecified chain, the default behavior is to jump back to the originating chain as can be seen in the image above i.e. the rule execution jumps from chain 2 and back to chain 1 rule 4, below the rule that sent the rule execution into chain 2 to begin with.

Each and every rule in the user-specified chain is traversed until either one of the rules matches or the end of the chain is reached. If we have a match before the end of the chain is reached, the target specifies if the traversing should end or continue.

If the end of the user-specified chain is reached, the packet is sent back to the invoking chain. The invoking chain can be either a user-specified chain or a built-in chain i.e. user-specified chains can be nested.

Rules

A rule could be described as the directions the packet filter will adhere to when blocking or permitting different connections and packets in a specific chain.

Each line we write that is inserted in a chain should be considered a rule. We will also discuss the basic matches that are available, and how to use them, as well as the different targets and how we can construct new targets of our own i.e. by creating user-specified chains.

Closer Look at Rules

Each rule is a line that the kernel looks at to find out what to do with an IP packet. If all the matches are met, the target/jump instruction is performed.

Normally we would write our rules in a syntax that looks something like this: iptables [-t table] command [match] [target/jump] like for example

iptables -t nat -A POSTROUTING -s 10.0.3.0/24 -o eth0 -j SNAT --to 10.0.0.34
         ^^^^^^ ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
         table  command        match                  target/jump

There is nothing that says that the target/jump instruction has to be the last function in the line. However, we would usually adhere to this convention to get the best readability — most of the rules we will see are written in this way. Hence, if we read someone else's script, we will most likely recognize the syntax and easily understand the rule.

If we want to use a table other than the standard table, we can insert the table specification at the point at which [table] is specified. However, it is not necessary to state explicitly what table to use, since by default iptables uses the filter table on which to implement all commands.

Neither do we have to specify the table at just this point in the rule. It could be set pretty much anywhere along the line. However, it is more or less convention to put the table specification at the beginning.

One more thing to think though is that the command should always come first, or alternatively directly after the table specification. We use command to tell iptables what to do, for example to insert (-I) a rule or to add (-A) a rule to the end of the chain, or to delete (-D) a rule.

The match is the part of the rule that we send to the kernel that details the specific character of the packet — what makes it different from all other packets. Here we could specify what IP address the packet comes from, from which network interface, the intended IP address, port, protocol or whatever. There is a heap of different matches that we can use.

Finally we have the target/jump of the packet. If all the matches are met for a packet, we tell the kernel what to do with it. We could, for example, tell the kernel to send the packet to a user-specified chain that we have created ourselves, and which is part of this particular table. We could also tell the kernel to drop the packet and do no further processing, or we could tell the kernel to send a specified reply to the sender.

Example Commands

The command tells iptables what to do with the rest of the rule. Normally we would want either to add or delete something in some table or another. The following commands are available to iptables:

-A or --append
- Example: iptables -A INPUT...
- Explanation: This command appends the rule to the end of the chain. The rule will in other words always be put last in the rule-set and hence be checked last, unless we append more rules later on.
-D or --delete
- Example: iptables -D INPUT --dport 80 -j DROP or iptables -D INPUT 1.
- Explanation: This command deletes a rule in a chain. This could be done in two ways. Either by entering the whole rule to match (as in the first example), or by specifying the rule number that we want to match. If we use the first method, our entry must match the entry in the chain exactly. If we use the second method, we must match the number of the rule we want to delete. The rules are numbered from the top of each chain, starting with number 1.
-R or --replace
- Example: iptables -R INPUT 1 -s 192.168.0.1 -j DROP.
- Explanation: This command replaces the old entry at the specified line. It works in the same way as the --delete command, but instead of totally deleting the entry, it will replace it with a new entry. The main use for this might be while we are still experimenting with iptables.
-I or --insert
- Example: iptables -I INPUT 1 --dport 80 -j ACCEPT.
- Explanation: Insert a rule somewhere in a chain. The rule is inserted as the actual number that we specify. In other words, the above example would be inserted as rule 1 in the INPUT chain, and hence from now on it would be the very first rule in the chain. Hint: see --line-numbers and --list to get a numbered list or rules.
-L or --list
- Example: iptables -L INPUT.
- Explanation: This command lists all the entries/rules in the specified chain. In the above case, we would list all the entries in the INPUT chain. It is also legal to not specify any chain at all. In the last case, the command would list all the chains in the specified table. The exact output is affected by other options like for example the -n and -v etc.
-S or --list-rules
- Example: iptables -S INPUT
- Explanation: Print all rules in the selected chain e.g. INPUT. If no chain is selected, all chains are printed like with iptables-save. Like every other iptables command, it applies to the specified table (filter is the default).
-F or --flush
- Example: iptables -F INPUT.
- Explanation: This command flushes all rules from the specified chain and is equivalent to deleting each rule one by one, but is quite a bit faster. The command can be used without options, and will then delete all rules in all chains within the specified table.
-Z or --zero
- Example: iptables -Z INPUT.
- Explanation: This command tells the program to zero all counters in a specific chain, or in all chains. If we have used the -v option with the -L command, we have probably seen the packet counter at the beginning of each field. To zero this packet counter, we use the -Z option. This option works the same as -L, except that -Z will not list the rules. If -L and -Z is used together, the chains will first be listed, and then the packet counters are zeroed.

-N or --new-chain
- Example: iptables -N allowed.
- Explanation: This command tells the kernel to create a new chain of the specified name in the specified table. In the above example we create a chain called allowed. Note that there must not already be a chain or target of the same name.
-X or --delete-chain
- Example: iptables -X allowed.
- Explanation: This command deletes the specified chain from the table. For this command to work, there must be no rules that refer to the chain that is to be deleted. In other words, we would have to replace or delete all rules referring to the chain before actually deleting the chain. If this command is used without any options, all chains but the built-in ones to the specified table will be deleted.
-P or --policy
- Example: iptables -P INPUT DROP.
- Explanation: This command tells the kernel to set a specified default target, or policy, on a chain. All packets that do not match any rule will then be forced to use the policy of the chain. Legal targets are DROP and ACCEPT.
-E or --rename-chain
- Example: iptables -E <old_name> <new_name>.
- Explanation: The -E command tells iptables to change the name of the chain old_name to new_name. Note that this will not affect the actual way the table will work. It is, in other words, just a cosmetic change to the table.

We should always enter a complete command line, unless we just want to list the built-in help for iptables or get the version of the command. To get the version, -v can be used and to get the help message, -h is the way to go.

Matches

A match is something that specifies a special condition within the packet that must be true (or false). A single rule can contain several matches of any kind.

For example, we may want to match packets that come from a specific host located within our LAN, and (logical AND) on top of that only from specific ports on that host. We could then use matches to tell the rule to only apply the target (or jump specification) on packets that have a specific source address, that come in on the interface that connects to the LAN and the packets must be one of the specified ports.

If any one of these matches fails (e.g. the source address is not correct, but everything else is true), the whole rule fails and the next rule is tested on the packet. If all matches are true, however, the target specified by the rule is applied.

Roughly speaking, matches can be classified into five different subcategories:

First of all we have the generic matches, which can be used in all rules.
Then we have the TCP matches which can only be applied to TCP packets.
We have UDP matches which can only be applied to UDP packets, and
ICMP matches which can only be used on ICMP packets.
Finally we have special matches, such as the state, owner and limit matches and so on.

These final matches have in turn been narrowed down to even more subcategories, even though they might not necessarily be different matches at all.

Generic Matches

A generic match is the kind of match that is always available, whatever kind of protocol we are working on, or whatever match extensions we have loaded. No special parameters at all are needed to use these matches.

-p or --protocol:
- Example: iptables -A INPUT -p tcp
- Explanation: This match is used to check for certain protocols. Examples of protocols are TCP, UDP and ICMP. The protocol must either be one of the internally specified TCP, UDP or ICMP. It may also take a value specified in the /etc/protocols file, and if it can not find the protocol there it will reply with an error. The protocol may also be an integer value — the ICMP protocol is integer value 1, TCP is 6 and UDP is 17. Finally, it may also take the value ALL. ALL means that it matches only TCP, UDP and ICMP. If this match is given the integer value of zero (0), it means ALL protocols, which in turn is the default behavior, if the --protocol match is not used. This match can also be inversed with the ! sign, so --protocol ! tcp would mean to match UDP and ICMP.
-s or --src or --source
- Example: iptables -A INPUT -s 192.168.1.1
- Explanation: This is the source match, which is used to match packets, based on their source IP address. The main form can be used to match single IP addresses, such as 192.168.1.1. It could also be used with a netmask in a CIDR (Classless Inter-Domain Routing) bit form, by specifying the number of ones (1's) on the left side of the network mask. This means that we could for example add /24 to use a 255.255.255.0 netmask. We could then match whole IP ranges, such as our local networks or network segments behind the firewall. The line would then look something like 192.168.0.0/24. This would match all packets in the 192.168.0.x range. Another way is to do it with a regular netmask in the 255.255.255.255 form i.e. 192.168.0.0/255.255.255.0). We could also invert the match with an ! just as before. If we were, to use a match in the form of --source ! 192.168.0.0/24, we would match all packets with a source address not coming from within the 192.168.0.x range. The default is to match all IP addresses if no particular source IP address or IP address range is specified.
-d or --dst or --destination
- Example: iptables -A INPUT -d 192.168.1.1
- Explanation: The --destination match is used for packets based on their destination address or addresses. It works pretty much the same as the --source match and has the same syntax, except that the match is based on where the packets are going to. To match an IP range, we can add a netmask either in the exact netmask form, or in the number of ones (1's) counted from the left side of the netmask bits. Examples are: 192.168.0.0/255.255.255.0 and 192.168.0.0/24. Both of these are equivalent. We could also invert the whole match with an ! sign, just as before i.e. --destination ! 192.168.0.1 would match all packets except those destined to the 192.168.0.1 IP address.
-i or --in-interface
- Example: iptables -A INPUT -i eth0
- Explanation: This match is used for the interface the packet came in on. Note that this option is only legal in the INPUT, FORWARD and PREROUTING chains and will return an error message when used anywhere else. The default behavior of this match, if no particular interface is specified, is to assume a string value of + (a glob). The + value is used to match a string of letters and numbers. A single + would, tell the kernel to match all packets without considering which interface it came in on. The + string can also be appended to the type of interface, so eth+ would be all ethernet devices. We can also invert the meaning of this option with the help of the ! sign. The line would then have a syntax looking something like -i ! eth0, which would match all incoming interfaces, except eth0.
-o or --out-interface
- Example: iptables -A FORWARD -o eth0
- Explanation: The --out-interface match is used for packets on the interface from which they are leaving. Note that this match is only available in the OUTPUT, FORWARD and POSTROUTING chains, the opposite in fact of the --in-interface match. Other than this, it works pretty much the same as the --in-interface match. The + extension is understood as matching all devices of similar type, so eth+ would match all eth devices and so on. To invert the meaning of the match, we can use the ! sign in exactly the same way as for the --in-interface match. If no --out-interface is specified, the default behavior for this match is to match all devices, regardless of where the packet is going.
-f or --fragment
- Example: iptables -A INPUT -f
- Explanation: This match is used to match the second and third part of a fragmented packet. The reason for this is that in the case of fragmented packets, there is no way to tell the source or destination ports of the fragments, nor ICMP types, among other things. Also, fragmented packets might in rather special cases be used to compound attacks against other computers. Packet fragments like this will not be matched by other rules, and hence this match was created. This option can also be used in conjunction with the ! sign. However, in this case the ! sign must precede the match, i.e. ! -f. When this match is inverted, we match all header fragments and/or unfragmented packets. What this means, is that we match all the first fragments of fragmented packets, and not the second, third, and so on. We also match all packets that have not been fragmented during transfer. Note also that there are really good defragmentation options within the kernel that we can use instead. As a secondary note, if we use connection tracking we will not see any fragmented packets, since they are dealt with before hitting any chain or table.

Implicit matches

This section will describe the matches that are loaded implicitly. Implicit matches are implied, taken for granted, automatic.

For example when we match on --protocol tcp without any further criteria. There are currently three types of implicit matches for three different protocols. These are

TCP matches
UDP matches
ICMP matches

The TCP based matches contain a set of unique criteria that are available only for TCP packets. UDP based matches contain another set of criteria that are available only for UDP packets. And the same thing for ICMP packets.

On the other hand, there can be explicit matches that are loaded explicitly. Explicit matches are not implied or automatic, we have to specify them specifically. For these we use the -m or --match option, which we will discuss in the next section.

TCP matches

These matches are protocol specific and are only available when working with TCP packets and streams. To use these matches, we need to specify --protocol tcp before trying to use them.

Note that the --protocol tcp match must be to the left of the protocol specific matches. These matches are loaded implicitly in a sense, just as the UDP and ICMP matches are loaded implicitly.

--sport or --source-port
- Example: iptables -A INPUT -p tcp --sport 22
- Explanation: The --source-port match is used to match packets based on their source port. Without it, we imply all source ports. This match can either take a service name or a port number. If we specify a service name, the service name must be in the /etc/services file, since iptables uses this file to map service names to ports. If we specify the port by its number, the rule will load slightly faster, since iptables do not have to check up the service name. However, the match might be a little bit harder to read than if we use the service name. If we are writing a rule-set consisting of a 200 rules or more, we should definitely use port numbers, since the difference is really noticeable. (On a slow box, this could make as much as 10 seconds difference, if we have configured a large rule-set containing 1000 rules or so). We can also use the --source-port match to match any range of ports, --source-port 22:80 for example. This example would match all source ports between 22 and 80. If we omit specifying the first port, port 0 is assumed (is implicit) i.e. --source-port :80 would then match port 0 through 80. And if the last port specification is omitted, port 65535 is assumed i.e. if we were to write --source-port 22:, we would have specified a match for all ports from port 22 through port 65535. If we invert the port range, iptables automatically reverses our inversion. If we write --source-port 80:22, it is simply interpreted as --source-port 22:80. We can also invert a match by adding a ! sign. For example, --source-port ! 22 means that we want to match all ports but port 22. The inversion could also be used together with a port range and would then look like --source-port ! 22:80, which in turn would mean that we want to match all ports but ports 22 through 80. Note that this match does not handle multiple separated ports and port ranges. For more information about those, look at the multiport match extension.
--dport or --destination-port
- Example: iptables -A INPUT -p tcp --dport 22
- Explanation: This match is used to match TCP packets, according to their destination port. It uses exactly the same syntax as the --source-port match. It understands port and port range specifications, as well as inversions. It also reverses high and low ports in port range specifications, as above. The match will also assume values of 0 and 65535 if the high or low port is left out in a port range specification. In other words, exactly the same as the --source-port syntax. Note that this match does not handle multiple separated ports and port ranges. For more information about those, look at the multiport match extension.
--tcp-flags
- Example: iptables -p tcp --tcp-flags SYN,FIN,ACK SYN
- Explanation: This match is used to match on the TCP flags in a packet. First of all, the match takes a list of TCP flags to compare (a mask) and secondly it takes a list of flags that should be set (i.e. be 1). Both lists should be comma-delimited. The match knows about the SYN, ACK, FIN, RST, URG, PSH flags, and it also recognizes the words ALL and NONE. ALL and NONE is pretty much self-explaining: ALL means to use all flags and NONE means to use no flags for the option e.g. --tcp-flags ALL NONE would mean to check all of the TCP flags and match if none of the flags are set. This option can also be inverted with the ! sign. For example, if we specify ! SYN,FIN,ACK SYN, we would get a match that would match packets that had the ACK and FIN bits set, but not the SYN bit.
--syn
- Example: iptables -p tcp --syn
- Explanation: The --syn match is more or less an old relic from the ipchains days and is still there for backward compatibility and for and to make transition one to the other easier. It is used to match packets if they have the SYN bit set and the ACK and RST bits unset. This command would in other words be exactly the same as the --tcp-flags SYN,RST,ACK SYN match. Such packets are mainly used to request new TCP connections from a server. If we block these packets, we should have effectively blocked all incoming connection attempts. However, we will not have blocked the outgoing connections, which a lot of exploits today use (for example, hacking a legitimate service and then installing a program or suchlike that enables initiating an existing connection to our machine, instead of opening up a new port on it). This match can also be inverted with the ! sign like this ! --syn. This would match all packets with the RST or the ACK bits set, in other words packets in an already established connection.
--tcp-option
- Example: iptables -p tcp --tcp-option 16
- Explanation: This match is used to match packets depending on their TCP options. A TCP Option is a specific part of the header. This part consists of 3 different fields. The first one is 8 bits long and tells us which options are used in this stream, the second one is also 8 bits long and tells us how long the options field is. The reason for this length field is that TCP options are optional. To be compliant with the standards, we do not need to implement all options, but instead we can just look at what kind of option it is, and if we do not support it, we just look at the length field and can then jump over this data. This match is used to match different TCP options depending on their decimal values. It may also be inverted with the ! flag, so that the match matches all TCP options but the option given to the match.

UDP matches

This section describes matches that will only work together with UDP packets. These matches are implicitly loaded when we specify the --protocol udp match and will be available after this specification.

Note that UDP packets are not connection oriented, and hence there is no such thing as different flags to set in the packet to give data on what the datagram is supposed to do, such as open or closing a connection, or if they are just simply supposed to send data.

UDP packets do not require any kind of acknowledgment either. If they are lost, they are simply lost (Not taking ICMP error messaging etc. into account). This means that there are quite a lot less matches to work with on a UDP packet than there is on TCP packets.

Also note that the state machine will work on all kinds of packets even though UDP or ICMP packets are counted as connectionless protocols. The state machine works pretty much the same on UDP packets as on TCP packets.

--sport or --source-port
- Example: iptables -A INPUT -p udp --sport 53
- Explanation: This match works exactly the same as its TCP counterpart. It is used to perform matches on packets based on their source UDP ports. It has support for port ranges, single ports and port inversions with the same syntax. To specify a UDP port range, we could use 22:80 which would match UDP ports 22 through 80. If the first value is omitted, port 0 is assumed. If the last port is omitted, port 65535 is assumed. If the high port comes before the low port, the ports switch place with each other automatically. Single UDP port matches look as in the example above. To invert the port match, add a ! sign like for example --source-port ! 53. This would match all ports but port 53. The match can understand service names, as long as they are available in the /etc/services file. Note that this match does not handle multiple separated ports and port ranges. For more information about this, look at the multiport match extension.
--dport or --destination-port
- Example: iptables -A INPUT -p udp --dport 53
- Explanation: The same goes for this match as for --source-port above. It is exactly the same as for the equivalent TCP match, but here it applies to UDP packets. It matches packets based on their UDP destination port. The match handles port ranges, single ports and inversions. To match a single port we use, for example, --destination-port 53, to invert this we would use --destination-port ! 53. The first would match all UDP packets going to port 53 while the second would match all packets but those going to the destination port 53. To specify a port range, we would, for example, use --destination-port 9:19. This example would match all packets destined for UDP port 9 through 19. If the first port is omitted, port 0 is assumed. If the second port is omitted, port 65535 is assumed. If the high port is placed before the low port, they automatically switch place, so the low port winds up before the high port. Note that this match does not handle multiple ports and port ranges. For more information about this, look at the multiport match extension.

ICMP matches

These are the ICMP matches. These packets are even more ephemeral, that is to say short lived, than UDP packets, in the sense that they are connectionless.

The ICMP protocol is mainly used for error reporting and for connection controlling and the like. ICMP is not a protocol subordinated to the IP protocol, but more of a protocol that augments the IP protocol and helps in handling errors.

The headers of ICMP packets are very similar to those of the IP headers, but differ in a number of ways. The main feature of this protocol is the type header, that tells us what the packet is for. One example is, if we try to access an inaccessible IP address, we would normally get an ICMP host unreachable in return. There is only one ICMP specific match available for ICMP packets, and hopefully this should suffice.

This match is implicitly loaded when we use the --protocol icmp match and we get access to it automatically. Note that all the generic matches can also be used, so that among other things we can match on the source and destination addresses.

--icmp-type
- Example: iptables -A INPUT -p icmp --icmp-type 8
- Explanation: This match is used to specify the ICMP type to match. ICMP types can be specified either by their numeric values or by their names. Numerical values are specified in RFC 792. To find a complete listing of the ICMP name values, do an iptables --protocol icmp --help. This match can also be inverted with the ! sign e.g. --icmp-type ! 8. Note that some ICMP types are obsolete, and others again may be dangerous for an unprotected host since they may, among other things, redirect packets to the wrong places. The type and code may also be specified by their typename, numeric type, and type/code as well. For example --icmp-type network-redirect, --icmp-type 8 or --icmp-type 8/0.

Please note that netfilter uses ICMP type 255 to match all ICMP types. If we try to match this ICMP type, we will wind up with matching all ICMP types.

SCTP matches

SCTP (Stream Control Transmission Protocol) is a relatively new occurrence in the networking domain in comparison to the TCP and UDP protocols. The implicit SCTP matches are loaded through adding the -p sctp match to the command line of iptables.

The SCTP protocol was developed by some of the larger telecom and switch/network manufacturers out there, and the protocol is specifically well suited for large simultaneous transactions with high reliability and high throughput.

--source-port or --sport
- Example: iptables -A INPUT -p sctp --source-port 80
- Explanation: The --source-port match is used to match an SCTP packet based on the source port in the SCTP packet header. The port can either be a single port, as in the example above, or a range of ports specified as --source-port 20:100, or it can also be inverted with the ! sign. This looks, for example, like --source-port ! 25. The source port is an unsigned 16 bit integer, so the maximum value is 65535 and the lowest value is 0.
--destination-port or --dport
- Example: iptables -A INPUT -p sctp --destination-port 80
- Explanation: This match is used for the destination port of the SCTP packets. All SCTP packets contain a destination port, just as it does a source port, in the headers. The port can be either specified as in the example above, or with a port range such as --destination-port 6660:6670. The command can also be inverted with the ! sign, for example, --destination-port ! 80. This example would match all packets but those to port 80. The same applies for destination ports as for source ports, the highest port is 65535 and the lowest is 0.
--chunk-types
- Example: iptables -A INPUT -p sctp --chunk-types any INIT,INIT_ACK
- Explanation: This matches the chunk type of the SCTP packet. The match begins with the --chunk-types keyword, and then continues with a flag of either all, any or only. After this, we specify the SCTP chunk types to match for. Additionally, the flags can take some chunk flags as well. This is done for example in the form --chunk-types any DATA:Be. The flags are specific for each SCTP chunk type and must be valid according to list below. If an upper case letter is used, the flag must be set, and if a lower case flag is set it must be unset to match. The whole match can be inversed by using an ! sign just after the --chunk-types keyword. For example, --chunk-types ! any DATA:Be would match anything but this pattern.
  - Chunk types: DATA, INIT, INIT_ACK, SACK, HEARTBEAT, HEARTBEAT_ACK, ABORT SHUTDOWN, SHUTDOWN_ACK, ERROR, COOKIE_ECHO, COOKIE_ACK, ECN_ECNE ECN_CWR, SHUTDOWN_COMPLETE, ASCONF, ASCONF_ACK.
  - Flags: The following flags can be used with the --chunk-types match as seen above.
    - DATA: U or u for unordered bit, B or b for beginning fragment bit and E or e for ending fragment bit.
    - ABORT: T or t for TCB destroy flag.
    - SHUTDOWN_COMPLETE: T or t for TCB destroyed flag.

Explicit matches

Explicit matches are those that have to be specifically loaded with the -m or --match option. State matches, for example, demand the directive -m <state> prior to entering the actual match that we want to use.

Some of these matches may be protocol specific. Some may be unconnected with any specific protocol — for example connection states. These might be NEW (the first packet of an as yet unestablished connection), ESTABLISHED (a connection that is already registered in the kernel), RELATED (a new connection that was created by an older, established one) etc.

A few may just have been evolved for testing or experimental purposes, or just to illustrate what iptables is capable of. This in turn means that not all of these matches may at first sight be of any use. Nevertheless, it may well be that we personally will find a use for specific explicit matches. And there are new ones coming along all the time, with each new iptables release.

Whether we find a use for them or not depends on our imagination and needs. The difference between implicitly loaded matches and explicitly loaded ones, is that the implicitly loaded matches will automatically be loaded when, for example, we match on the properties of TCP packets, while explicitly loaded matches will never be loaded automatically — it is up to us to discover and activate explicit matches.

Addrtype match

The addrtype module matches packets based on the address type. The address type is used inside the kernel to put different packets into different categories.

With this match we will be able to match all packets based on their address type according to the kernel. It should be noted that the exact meaning of the different address types varies between the OSI layer 3 protocols. The available types are as follows:

ANYCAST: This is a one-to-many associative connection type, where only one of the many receiver hosts actually receives the data. This is for example implemented with the DNS (Domain Name System). We have single address to a root server, but it actually has several locations and our packet will be directed to the closest working server. This one is not implemented in Linux IPv4.
BLACKHOLE: A blackhole address will simply delete the packet and send no reply. It works as a black hole in space basically. This is configured in the routing tables of Linux.
BROADCAST: A broadcast packet is a single packet sent to everyone in a specific network in a one-to-many relation. This is for example used in ARP (Address Resolution Protocol) resolution, where a single packet is sent out requesting information on how to reach a specific IP, and then the host that is authoritative replies with the proper MAC (Media Access Control) address of that host.
LOCAL: An address that is local to the host we are working on e.g. 127.0.0.1.
MULTICAST: A multicast packet is sent to several hosts using the shortest distance and only one packet is sent to each waypoint where it will be multiple copies for each host/router subscribing to the specific multicast address. Commonly used in one way streaming media such as video or sound.
NAT: An address that has been NAT'ed by the kernel.
PROHIBIT: Same as blackhole except that a prohibited answer will be generated. In the IPv4 case, this means an ICMP communication prohibited (type 3, code 13) answer will be generated.
THROW: Special route in the Linux kernel. If a packet is thrown in a routing table it will behave as if no route was found in the table. In normal routing, this means that the packet will behave as if it had no route. In policy routing, another route might be found in another routing table.
UNICAST: A real routable address for a single address. The most common type of route.
UNREACHABLE: This signals an unreachable address that we do not know how to reach. The packets will be discarded and an ICMP Host unreachable (type 3, code 1) will be generated.
UNSPEC: An unspecified address that has no real meaning.
XRESOLVE: This address type is used to send route lookups to userspace applications which will do the lookup for the kernel. This might be wanted to send ugly lookups to the outside of the kernel, or to have an application do lookups for us. However, as of now (April 2009) this one is not implemented in Linux.

The addrtype match is loaded by using the -m addrtype keyword. When this is done, the extra match options below will be available for usage:

--src-type
- Example: iptables -A INPUT -m addrtype --src-type UNICAST
- Explanation: The --src-type match option is used to match the source address type of the packet. It can either take a single address type or several separated by coma signs, for example --src-type BROADCAST,MULTICAST. The match option may also be inverted by adding an exclamation sign before it, for example ! --src-type BROADCAST,MULTICAST.
--dst-type
- Example: iptables -A INPUT -m addrtype --dst-type UNICAST
- Explanation: The --dst-type works exactly the same way as --src-type and has the same syntax. The only difference is that it will match packets based on their destination address type.

AH/ESP match

These matches are used for the IPSEC AH (Authentication Header) and ESP (Encapsulating Security Payload) protocols. IPSEC is used to create secure tunnels over an insecure Internet connection.

The AH and ESP protocols are used by IPSEC to create these secure connections. The AH and ESP matches are really two separate matches, but are both described here since they look very much alike, and both are used in the same function.

To use the AH/ESP matches, you need to use -m ah to load the AH matches, and -m esp to load the ESP matches.

--ahspi
- Example: iptables -A INPUT -p 51 -m ah --ahspi 500
- Explanation: This matches the AH SPI (Security Parameter Index) number of the AH packets. Please note that we must specify the protocol as well, since AH runs on a different protocol than the standard TCP, UDP or ICMP protocols. The SPI number is used in conjunction with the source and destination address and the secret keys to create a SA (Security Association). The SA uniquely identifies each and every one of the IPSEC tunnels to all hosts. The SPI is used to uniquely distinguish each IPSEC tunnel connected between the same two peers. Using the --ahspi match, we can match a packet based on the SPI of the packets. This match can match a whole range of SPI values by using a : sign, such as 500:520, which will match the whole range of SPI's.
--espspi
- Example: iptables -A INPUT -p 50 -m esp --espspi 500
- Explanation: The ESP counterpart SPI is used exactly the same way as the AH variant. The match looks exactly the same, with the esp/ah difference. Of course, this match can match a whole range of SPI numbers as well as the AH variant of the SPI match, such as --espspi 200:250 which matches the whole range of SPI's.

Comment match

The comment match is used to add comments inside the iptables ruleset and the kernel. This can make it much easier to understand our ruleset and thus ease debugging a lot.

For example, we could add comments documenting which bash function added specific sets of rules to netfilter, and why. It should be noted that this is not actually a match. The comment match is loaded using the -m comment keywords. At this point the following options are available:

--comment
- Example: iptables -A INPUT -m comment --comment "A comment"
- Explanation: The --comment option specifies the comment to actually add to the rule within the kernel. The comment can be a maximum of 256 characters.

Connmark match

The connmark match is used very much the same way as the mark match is in the MARK target and match combination. The connmark match is used to match marks which have been set on a connection with the CONNMARK target.

To match a mark on the same packet as is the first to create the connection marking, we must use the connmark match after the CONNMARK target has set the mark on the first packet.

--mark
- Example: iptables -A INPUT -m connmark --mark 12 -j ACCEPT
- Explanation: The mark option is used to match a specific mark associated with a connection. The mark match must be exact, and if we want to filter out unwanted flags from the connection mark before actually matching anything, we can specify a mask that will be added to the connection mark. For example, if we have a connection mark set to 33 (10001 in binary) on a connection, and want to match the first bit only, we would be able to run something like --mark 1/1. The mask (00001) would be masked to 10001, so 10001 && 00001 equals 1, and then matched against the 1.

Conntrack match

The conntrack match is an extended version of the state match, which makes it possible to match packets in a much more granular way. It lets us look at information directly available in the connection tracking system, without any additional layers, such as in the state match.

There are a number of different matches put together in the conntrack match, for several different fields in the connection tracking system. These are compiled together into the list below. To load these matches, we need to specify -m conntrack.

--ctstate
- Example: iptables -A INPUT -p tcp -m conntrack --ctstate RELATED
- Explanation: This match is used to match the state of a packet, according to the conntrack state. It is used to match pretty much the same states as in the original state match. The valid entries for this match are: INVALID, ESTABLISHED, NEW, RELATED, SNAT and DNAT. The entries can be used together with each other separated by a comma. For example, -m conntrack --ctstate ESTABLISHED,RELATED. It can also be inverted by putting a ! in front of --ctstate e.g. -m conntrack ! --ctstate ESTABLISHED,RELATED, which matches all but the ESTABLISHED and RELATED states.
--ctproto
- Example: iptables -A INPUT -p tcp -m conntrack --ctproto TCP
- Explanation: This matches the protocol, the same as --protocol does. It can take the same types of values, and is inverted using the ! sign. For example, -m conntrack ! --ctproto TCP matches all protocols but the TCP protocol.
--ctorigsrc
- Example: iptables -A INPUT -p tcp -m conntrack --ctorigsrc 192.168.0.0/24
- Explanation: --ctorigsrc matches based on the original source IP specification of the conntrack entry that the packet is related to. The match can be inverted by using a ! between the --ctorigsrc and IP specification, such as --ctorigsrc ! 192.168.0.1. It can also take a netmask of the CIDR (Classless Inter-Domain Routing) form, such as --ctorigsrc 192.168.0.0/24.
--ctorigdst
- Example: iptables -A INPUT -p tcp -m conntrack --ctorigdst 192.168.0.0/24
- Explanation: This match is used exactly as the --ctorigsrc, except that it matches on the destination field of the conntrack entry. It has the same syntax in all other respects.
--ctreplsrc
- Example: iptables -A INPUT -p tcp -m conntrack --ctreplsrc 192.168.0.0/24
- Explanation: The --ctreplsrc match is used to match based on the original conntrack reply source of the packet. Basically, this is the same as the --ctorigsrc, but instead we match the reply source expected of the upcoming packets. This target can, of course, be inverted and address a whole range of addresses, just the same as the the previous targets in this class.
--ctrepldst
- Example: iptables -A INPUT -p tcp -m conntrack --ctrepldst 192.168.0.0/24
- Explanation: The —ctrepldst= match is the same as the --ctreplsrc match, with the exception that it matches the reply destination of the conntrack entry that matched the packet. It too can be inverted, and accept ranges, just as the --ctreplsrc match.
--ctstatus
- Example: iptables -A INPUT -p tcp -m conntrack --ctstatus RELATED
- Explanation: This matches the status of the connection, as described in the The state machine section. It can match the following statuses:
  - NONE, The connection has no status at all.
  - EXPECTED, This connection is expected and was added by one of the expectation handlers.
  - SEEN_REPLY, This connection has seen a reply but isn't assured yet.
  - ASSURED, The connection is assured and will not be removed until it times out or the connection is closed by either end. All the statuses can also be inverted by using the ! sign. For example -m conntrack ! --ctstatus ASSURED which will match all but the ASSURED status.
--ctexpire
- Example: iptables -A INPUT -p tcp -m conntrack --ctexpire 100:150
- Explanation: This match is used to match on packets based on how long is left on the expiration timer of the conntrack entry, measured in seconds. It can either take a single value and match against, or a range such as in the example above. It can also be inverted by using the ! sign, such as this -m conntrack ! --ctexpire 100. This will match every expiration time, which does not have exactly 100 seconds left to it.

Dscp match

This match is used to match on packets based on their DSCP (Differentiated Services Code Point) field. This is documented in the RFC 2638. The match is explicitly loaded by specifying -m dscp. The match can take two mutually exclusive options, described below.

--dscp
- Example: iptables -A INPUT -p tcp -m dscp --dscp 32
- Explanation: This option takes a DSCP value in either decimal or in hex. If the option value is in decimal, it would be written like 32 or 16 etc. If written in hex, it should be prefixed with 0x, like this: 0x20. It can also be inverted by using the ! character, like this: -m dscp ! --dscp 32.
--dscp-class
- Example: iptables -A INPUT -p tcp -m dscp --dscp-class BE
- Explanation: The --dscp-class match is used to match on the DiffServ class of a packet. The values can be any of the BE, EF, AFxx or CSx classes as specified in the various RFC's. This match can be inverted just the same way as the --dscp option.

Please note that the --dscp and --dscp-class options are mutually exclusive and can not be used in conjunction with each other.

Ecn match

The ecn match is used to match on the different ECN (Explicit Congestion Notification) fields in the TCP and IPv4 headers. ECN is described in detail in the RFC 3168. The match is explicitly loaded by using -m ecn in the command line. The ecn match takes three different options as described below.

--ecn
- Example: iptables -A INPUT -p tcp -m ecn --ecn-tcp-cwr
- Explanation: This match is used to match the CWR (Congestion Window Received) bit, if it has been set. The CWR flag is set to notify the other endpoint of the connection that they have received an ECE (ECN-Echo), and that they have reacted to it. Per default this matches if the CWR bit is set, but the match may also be inversed using an exclamation point.
--ecn-tcp-ece
- Example: iptables -A INPUT -p tcp -m ecn --ecn-tcp-ece
- Explanation: This match can be used to match the ECE (ECN-Echo) bit. The ECE is set once one of the endpoints has received a packet with the CE bit set by a router. The endpoint then sets the ECE in the returning ACK packet, to notify the other endpoint that it needs to slow down. The other endpoint then sends a CWR packet as described in the --ecn-tcp-cwr explanation. This matches per default if the ECE bit is set, but may be inversed by using an exclamation point.
--ecn-ip-ect
- Example: iptables -A INPUT -p tcp -m ecn --ecn-ip-ect 1
- Explanation: The --ecn-ip-ect match is used to match the ECT (ECN Capable Transport) codepoints. The ECT codepoints has several types of usage. Mainly, they are used to negotiate if the connection is ECN capable by setting one of the two bits to 1. The ECT is also used by routers to indicate that they are experiencing congestion, by setting both ECT codepoints to 1. The ECT values are all available in the in the ECN Field in IP table below. The match can be inversed using an exclamation point, for example ! --ecn-ip-ect 2 which will match all ECN values but the ECT(0) codepoint. The valid value range is 0-3 in iptables.

Hashlimit match

This is a modified version of the limit match. Instead of just setting up a single token bucket, it sets up a hash table pointing to token buckets for each destination IP, source IP, destination port and source port tuple.

For example, we can set it up so that every IP address can receive a maximum of 1000 packets per second, or we can say that every service on a specific IP address may receive a maximum of 200 packets per second. The hashlimit match is loaded by specifying the -m hashlimit keywords.

Each rule that uses the hashlimit match creates a separate hashtable which in turn has a specific max size and a maximum number of buckets. This hash table contains a hash of either a single or multiple values. The values can be any and/or all of destination IP, source IP, destination port and source port. Each entry then points to a token bucket that works as the limit match.

--hashlimit
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000/sec --hashlimit-mode dstip,dstport --hashlimit-name hosts
- Explanation: The --hashlimit (this option is mandatory for all hashlimit matches) specifies the limit of each bucket. In this example the hashlimit is set to 1000/sec. We have set up the hashlimit-mode to be dstip,dstport and destination 192.168.0.3. Hence, for every port or service on the destination host, it can receive 1000 packets per second. This is the same setting as the limit option for the limit match. The limit can take a /sec, /minute, /hour or /day postfix. If no postfix is specified, the default postfix is per second.
--hashlimit-mode
- Example: iptables -A INPUT -p tcp --dst 192.168.0.0/16 -m hashlimit --hashlimit 1000/sec --hashlimit-mode dstip --hashlimit-name hosts
- Explanation: The --hashlimit-mode option (this option is mandatory for all hashlimit matches) specifies which values we should use as the hash values. In this example, we use only the dstip (destination IP) as the hashvalue. So, each host in the 192.168.0.0/16 network will be limited to receiving a maximum of 1000 packets per second in this case. The possible values for the --hashlimit-mode is dstip (Destination IP), srcip (Source IP), dstport (Destination port) and srcport (Source port). All of these can also be separated by a comma sign to include more than one hashvalue, such as for example --hashlimit-mode dstip,dstport.
--hashlimit-name
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts
- Explanation: This option (this option is mandatory for all hashlimit matches) specifies the name that this specific hash will be available as. It can be viewed inside the /proc/net/ipt_hashlimit directory. The example above would be viewable inside the /proc/net/ipt_hashlimit/hosts file. Only the filename should be specified.
--hashlimit-burst
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-burst 2000
- Explanation: This match is the same as the --limit-burst in that it sets the maximum size of the bucket. Each bucket will have a burst limit, which is the maximum amount of packets that can be matched during a single time unit.
--hashlimit-htable-size
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-size 500
- Explanation: This sets the maximum available buckets to be used. In this example, it means that a maximum of 500 ports can be open and active at the same time.
--hashlimit-htable-max
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-max 500
- Explanation: The --hashlimit-htable-max sets the maximum number of hashtable entries. This means all of the connections, including the inactive connections that does not require any token buckets for the moment.
--hashlimit-htable-gcinterval
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-gcinterval 1000
- Explanation: How often should the garbage collection function be run. Generally speaking this value should be lower than the expire value. The value is measured in milliseconds. If it is set too low it will be taking up unnecessary system resources and processing power, but if it is too high it can leave unused token buckets lying around for too long and leaving other connections impossible. In this example the garbage collector will run every second.
--hashlimit-htable-expire
- Example: iptables -A INPUT -p tcp --dst 192.168.0.3 -m hashlimit --hashlimit 1000 --hashlimit-mode dstip,dstport --hashlimit-name hosts --hashlimit-htable-expire 10000
- Explanation: This value sets after how long time an idle hashtable entry should expire. If a bucket has been unused for longer than this, it will be expired and the next garbage collection run will remove it from the hashtable, as well as all of the information pertaining to it.

Helper match

This is a rather unorthodox match in comparison to the other matches, in the sense that it uses a little bit specific syntax.

The match is used to match packets, based on which conntrack helper that the packet is related to. For example, let us look at the FTP session. The control session is opened up, and the ports/connection is negotiated for the data session within the control session. The ip_conntrack_ftp helper module will find this information, and create a related entry in the conntrack table.

Now, when a packet enters, we can see which protocol it was related to, and we can match the packet in our ruleset based on which helper was used. The match is loaded by using the -m helper keyword.

--helper
- Example: iptables -A INPUT -p tcp -m helper --helper ftp-21
- Explanation: The --helper option is used to specify a string value, telling the match which conntrack helper to match. In the basic form, it may look like --helper irc. This is where the syntax starts to change from the normal syntax. We can also choose to only match packets based on which port that the original expectation was caught on. For example, the FTP Control session is normally transferred over port 21, but it may as well be port 954 or any other port. We may then specify upon which port the expectation should be caught on, like --helper ftp-954.

IP range match

The IP range match is loaded by using the -m iprange keyword. It is used to match IP ranges, just as the --source and --destination matches are able to do as well.

However, this match adds a different kind of matching in the sense that it is able to match in the manner of from IP / to IP, which the --source and --destination matches are unable to. This may be needed in some specific network setups, and it is rather a bit more flexible.

--src-range
- Example: iptables -A INPUT -p tcp -m iprange --src-range 192.168.1.13-192.168.2.19
- Explanation: This matches a range of source IP addresses. The range includes every single IP address from the first to the last, so the example above includes everything from 192.168.1.13 to 192.168.2.19. The match may also be inverted by adding an !. The above example would then look like -m iprange ! --src-range 192.168.1.13-192.168.2.19, which would match every single IP address, except the ones specified.
--dst-range
- Example: iptables -A INPUT -p tcp -m iprange --dst-range 192.168.1.13-192.168.2.19
- Explanation: The --dst-range works exactly the same as the --src-range match, except that it matches destination IP's instead of source IP's.

Length match

The length match is used to match packets based on their length. It is very simple. If we want to limit the packet length for some strange reason, or want to block ping-of-death-like behavior, we use the length match.

--length
- Example: iptables -A INPUT -p tcp -m length --length 1400:1500
- Explanation: The example --length will match all packets with a length between 1400 and 1500 bytes. The match may also be inversed using the ! sign, like this: -m length ! --length 1400:1500. It may also be used to match only a specific length, removing the : sign and onwards, like this: -m length --length 1400. The range matching is, of course, inclusive, which means that it includes all packet lengths in between the values we specify.

Limit match

The limit match extension must be loaded explicitly with the -m limit option. This match can, for example, be used to advantage to give limited logging of specific rules etc.

For example, we could use this to match all packets that do not exceed a given value, and after this value has been exceeded, limit logging of the event in question.

Think of a time limit: We could limit how many times a certain rule may be matched in a certain time frame, for example to lessen the effects of DoS syn flood attacks. This is its main usage, but there are more usages, of course. The limit match may also be inverted by adding a ! flag in front of the limit match. It would then be expressed as -m limit ! --limit 5/s. This means that all packets will be matched after they have broken the limit.

To further explain the limit match, it is basically a token bucket filter. Consider having a leaky bucket where the bucket leaks X packets per time-unit. X is defined depending on how many matching packets we get, so if we get 3 packets, the bucket leaks 3 packets per that time-unit.

The --limit option tells us how many packets to refill the bucket with per time-unit, while the --limit-burst option tells us how big the bucket is in the first place. So, setting

--limit 3/minute
--limit-burst 5

, and then receiving 5 matches will empty the bucket. After 20 seconds, the bucket is refilled with another token, and so on until the --limit-burst is reached again or until they get used.

Consider the example below for further explanation of how this may look.

We set a rule with -m limit --limit 5/second --limit-burst 10/second. The limit-burst token bucket is set to 10 initially. Each packet that matches the rule uses a token.
We get a packet that matches, 1-2-3-4-5-6-7-8-9-10, all within a 1/1000 of a second.
The token bucket is now empty. Once the token bucket is empty, the packets that qualify for the rule otherwise no longer match the rule and proceed to the next rule if any, or hit the chain policy e.g. DROP.
For each 1/5 second without a matching packet, the token count goes up by 1, up to a maximum of 10. 1 second after receiving the 10 packets, we will once again have 5 tokens left.
And of course, the bucket will be emptied by 1 token for each packet it receives.

--limit
- Example: iptables -A INPUT -m limit --limit 3/hour
- Explanation: This sets the maximum average match rate for the limit match. We specify it with a number and an optional time unit. The following time units are currently recognized: /second, /minute, /hour, and /day. The default value here is 3 per hour, or 3/hour. This tells the limit match how many times to allow the match to occur per time unit e.g. per minute.
--limit-burst
- Example: iptables -A INPUT -m limit --limit-burst 5
- Explanation: This is the setting for the burst limit of the limit match. It tells iptables the maximum number of tokens available in the bucket when we start, or when the bucket is full. This number gets decremented by one for every packet that arrives, down to the lowest possible value, 1. The bucket will be refilled by the limit value every time unit, as specified by the --limit option. The default --limit-burst value is 5.

Mac match

The MAC (Ethernet Media Access Control) match can be used to match packets based on their MAC source address. This match can be used to match packets on the source MAC address only as previously said. We explicitly load it with the -m mac option.

--mac-source
- Example: iptables -A INPUT -m mac --mac-source 00:00:00:00:00:01
- Explanation: This match is used to match packets based on their MAC source address. The MAC address specified must be in the form XX:XX:XX:XX:XX:XX, else it will not be legal. The match may be reversed with an ! sign and would look like --mac-source ! 00:00:00:00:00:01. This would in other words reverse the meaning of the match, so that all packets except packets from this MAC address would be matched. Note that since MAC addresses are only used on Ethernet type networks, this match will only be possible to use for Ethernet interfaces. The MAC match is only valid in the PREROUTING, FORWARD and INPUT chains.

Mark match

The mark match extension is used to match packets based on the marks they have set. A mark is a special field, only maintained within the kernel, that is associated with the packets as they travel through the computer.

Marks may be used by different kernel routines for such tasks as traffic shaping and filtering. As of today, there is only one way of setting a mark in Linux, namely the MARK target in iptables. This was previously done with the FWMARK target in ipchains, and this is why people still refer to FWMARK in advanced routing areas.

The mark field is currently set to an unsigned integer, or 4294967296 possible values on a 32 bit system. In other words, we are probably not going to run into this limit for quite some time.

--mark
- Example: iptables -t mangle -A INPUT -m mark --mark 1
- Explanation: This match is used to match packets that have previously been marked. Marks can be set with the MARK target. All packets traveling through netfilter get a special mark field associated with them. Note that this mark field is not in any way propagated, within or outside the packet. It stays inside the computer that made it. If the mark field matches the mark, it is a match. The mark field is an unsigned integer, hence there can be a maximum of 4294967296 different marks. We may also use a mask with the mark. The mark specification would then look like, for example, --mark 1/1. If a mask is specified, it is logically AND ed with the mark specified before the actual comparison.

Multiport match

The multiport match extension can be used to specify multiple destination ports and port ranges. Without the possibility this match gives, we would have to use multiple rules of the same type, just to match different ports.

We cannot use both standard port matching and multiport matching at the same time, for example we cannot write: --sport 1024:63353 -m multiport --dport 21,23,80. This will simply not work. What in fact happens, if we do, is that iptables honors the first element in the rule, and ignores the multiport instruction.

--source-port
- Example: iptables -A INPUT -p tcp -m multiport --source-port 22,53,80,110
- Explanation: This match matches multiple source ports. A maximum of 15 separate ports may be specified. The ports must be comma delimited, as in the above example. The match may only be used in conjunction with the -p tcp or -p udp matches. It is mainly an enhanced version of the normal --source-port match.
--destination-port
- Example: iptables -A INPUT -p tcp -m multiport --destination-port 22,53,80,110
- Explanation: This match is used to match multiple destination ports. It works exactly the same way as the above mentioned source port match, except that it matches destination ports. It too has a limit of 15 ports and may only be used in conjunction with -p tcp and -p udp.
--port
- Example: iptables -A INPUT -p tcp -m multiport --port 22,53,80,110
- Explanation: This match extension can be used to match packets based both on their destination port and their source port. It works the same way as the --source-port and --destination-port matches above. It can take a maximum of 15 ports and can only be used in conjunction with -p tcp and -p udp. Note that the --port match will only match packets coming in from and going to the same port, for example, port 80 to port 80, port 110 to port 110 and so on.

Owner match

The owner match extension is used to match packets based on the identity of the process that created them.

The owner can be specified as the PID (Process Identifier) either of the user who issued the command in question, that of the group, the process, the session, or that of the command itself.

This extension was originally written as an example of what iptables could be used for. The owner match only works within the OUTPUT chain, for obvious reasons — it is pretty much impossible to find out any information about the identity of the instance that sent a packet from the other end, or where there is an intermediate hop to the real destination.

Even within the OUTPUT chain it is not very reliable, since certain packets may not have an owner. Notorious packets of that sort are (among other things) the different ICMP responses. ICMP responses will never match.

--cmd-owner
- Example: iptables -A OUTPUT -m owner --cmd-owner httpd
- Explanation: This is the command owner match, and is used to match based on the command name of the process that is sending the packet. In the example, httpd is matched. This match may also be inverted by using an exclamation sign, for example -m owner ! --cmd-owner ssh.
--uid-owner
- Example: iptables -A OUTPUT -m owner --uid-owner 500
- Explanation: This packet match will match if the packet was created by the given UID (User ID). This could be used to match outgoing packets based on who created them. One possible use would be to block any other user than root from opening new connections outside our packet filter. Another possible use could be to block everyone but the http user from sending packets from the HTTP port.
--gid-owner
- Example: iptables -A OUTPUT -m owner --gid-owner 0
- Explanation: This match is used to match all packets based on their GID (Group ID). This means that we match all packets based on what group the user creating the packets is in. This could be used to block all but the users in the network group from getting out onto the Internet or, as described above, only to allow members of the http group to create packets going out from the HTTP port.
--pid-owner
- Example: iptables -A OUTPUT -m owner --pid-owner 78
- Explanation: This match is used to match packets based on the PID (Process Identifier) that was responsible for them. This match is a bit harder to use, but one example would be only to allow PID 94 to send packets from the HTTP port (if the HTTP process is not threaded, of course). Alternatively we could write a small script that grabs the PID from a ps output for a specific daemon and then add a rule for it.
--sid-owner
- Example: iptables -A OUTPUT -m owner --sid-owner 100
- Explanation: This match is used to match packets based on the SID (Session ID) used by the program in question. The value of the SID, or SID of a process, is that of the process itself and all processes resulting from the originating process. These latter could be threads, or a child of the original process. So, for example, all of our HTTPD processes should have the same SID as their parent process (the originating HTTPD process), if our HTTPD is threaded (most HTTPDs are, Apache and Roxen for instance).

The pid, sid and command matching is broken in SMP kernels since they use different process lists for each processor. It might be fixed in the future however

Packet type match

The packet type match is used to match packets based on their type i.e. are they destined to a specific person, to everyone or to a specific group of machines or users. These three groups are generally called unicast, broadcast and multicast. The match is loaded by using -m pkttype.

--pkt-type
- Example: iptables -A OUTPUT -m pkttype --pkt-type unicast
- Explanation: The --pkt-type match is used to tell the packet type match which packet type to match. It can either take unicast, broadcast or multicast as an argument, as in the example. It can also be inverted by using a ! like this: -m pkttype --pkt-type ! broadcast, which will match all other packet types.

Realm match

The realm match is used to match packets based on the routing realm that they are part of.

Routing realms are used in Linux for complex routing scenarios and setups such as when using BGP (Border Gateway Protocol) etc. The realm match is loaded by adding the -m realm keyword to the commandline.

A routing realm is used in Linux to classify routes into logical groups of routes. In most dedicated routers today, the RIB (Routing Information Base) and the forwarding engine are very close to eachother. Inside the kernel for example. Since Linux is not really a dedicated routing system, it has been forced to separate its RIB and FIB (Forwarding Information Base).

The RIB lives in userspace and the FIB lives inside kernelspace. Because of this separation, it becomes quite resourceheavy to do quick searches in the RIB. The routing realm is the Linux solution to this, and actually makes the system more flexible and richer.

The Linux realms can be used together with BGP and other routing protocols that delivers huge amounts of routes. The routing daemon can then sort the routes by their prefix, aspath, or source for example, and put them in different realms. The realm is numeric, but can also be named through the /etc/iproute2/rt_realms file.

--realm
- Example: iptables -A OUTPUT -m realm --realm 4
- Explanation: This option matches the realm number and optionally a mask. If this is not a number, it will also try and resolve the realm from the /etc/iproute2/rt_realms file also. If a named realm is used, no mask may be used. The match may also be inverted by setting an exclamation sign, for example --realm ! cosmos.

Recent match

The recent match is a rather large and complex matching system, which allows us to match packets based on recent events that we have previously matched.

For example, if we would see an outgoing IRC connection, we could set the IP addresses into a list of hosts, and have another rule that allows identd requests back from the IRC server within 15 seconds of seeing the original packet.

Before we can take a closer look at the match options, let us try and explain a little bit how it works:

First of all, we use several different rules to accomplish the use of the recent match. The recent match uses several different lists of recent events. The default list being used is the DEFAULT list. We create a new entry in a list with the set option, so once a rule is entirely matched (the set option is always a match), we also add an entry in the recent list specified.

The list entry contains a timestamp, and the source IP address used in the packet that triggered the set option. Once this has happened, we can use a series of different recent options to match on this information, as well as update the entries timestamp etc.

Finally, if we would for some reason want to remove a list entry, we would do this using the --remove match option from the recent match.

All rules using the recent match, must load the recent module (-m recent) as usual. Before we go on with an example of the recent match, let's take a look at all the options:

--name
- Example: iptables -A OUTPUT -m recent --name examplelist
- Explanation: The name option gives the name of the list to use. Per default the DEFAULT list is used, which is probably not what we want if we are using more than one list.
--set
- Example: iptables -A OUTPUT -m recent --set
- Explanation: This creates a new list entry in the named recent list, which contains a timestamp and the source IP address of the host that triggered the rule. This match will always return success, unless it is preceded by a ! sign, in which case it will return failure.
--rcheck
- Example: iptables -A OUTPUT -m recent --name examplelist --rcheck
- Explanation: The --rcheck option will check if the source IP address of the packet is in the named list. If it is, the match will return true, otherwise it returns false. The option may be inverted by using the ! sign. In the later case, it will return true if the source IP address is not in the list, and false if it is in the list.
--update
- Example: iptables -A OUTPUT -m recent --name examplelist --update
- Explanation: This match is true if the source combination is available in the specified list and it also updates the last-seen time in the list. This match may also be reversed by setting the ! mark in front of the match. For example, ! --update.
--remove
- Example: iptables -A INPUT -m recent --name example --remove
- Explanation: This match will try to find the source address of the packet in the list, and returns true if the packet is there. It will also remove the corresponding list entry from the list. The command is also possible to inverse with the ! sign.
--seconds
- Example: iptables -A INPUT -m recent --name example --check --seconds 60
- Explanation: This match is only valid together with the --check and --update matches. The --seconds match is used to specify how long since the last seen column was updated in the recent list. If the last seen column was older than this amount in seconds, the match returns false. Other than this the recent match works as normal, so the source address must still be in the list for a true return of the match.
--hitcount
- Example: iptables -A INPUT -m recent --name example --check --hitcount 20
- Explanation: The --hitcount match must be used together with the --check or --update matches and it will limit the match to only include packets that have seen at least the hitcount amount of packets. If this match is used together with the --seconds match, it will require the specified hitcount packets to be seen in the specific timeframe. This match may also be reversed by adding a ! sign in front of the match. Together with the --seconds match, this means that a maximum of this amount of packets may have been seen during the specified timeframe. If both of the matches are inversed, then a maximum of this amount of packets may have been seen during the last minumum of seconds.
--rttl
- Example: iptables -A INPUT -m recent --name example --check --rttl
- Explanation: The --rttl match is used to verify that the TTL value of the current packet is the same as the original packet that was used to set the original entry in the recent list. This can be used to verify that people are not spoofing their source address to deny others access to our servers by making use of the recent match.
--rsource
- Example: iptables -A INPUT -m recent --name example --rsource
- Explanation: The --rsource match is used to tell the recent match to save the source address and port in the recent list. This is the default behavior of the recent match.
--rdest
- Example: iptables -A INPUT -m recent --name example --rdest
- Explanation: The --rdest match is the opposite of the --rsource match in that it tells the recent match to save the destination address and port to the recent list.

Below is a small sample script which demonstrates how the recent match can be used:

#!/bin/bash

iptables -N http-recent
iptables -N http-recent-final
iptables -N http-recent-final1
iptables -N http-recent-final2

iptables -A INPUT -p tcp --dport 80 -j http-recent

# http-recent-final, has this connection been deleted from httplist or not?
#
iptables -A http-recent-final -p tcp -m recent --name httplist -j http-recent-final1
iptables -A http-recent-final -p tcp -m recent --name http-recent-final -j http-recent-final2


# http-recent-final1, this chain deletes the connection from the httplist
# and adds a new entry to the http-recent-final
#
iptables -A http-recent-final1 -p tcp -m recent --name httplist --tcp-flags SYN,ACK,FIN FIN,ACK --close -j ACCEPT
iptables -A http-recent-final1 -p tcp -m recent --name http-recent-final --tcp-flags SYN,ACK,FIN FIN,ACK --set -j ACCEPT


# http-recent-final2, this chain allows final traffic from non-closed host
# and listens for the final FIN and FIN,ACK handshake.
#
iptables -A http-recent-final2 -p tcp --tcp-flags SYN,ACK NONE -m recent --name http-recent-final --update -j ACCEPT
iptables -A http-recent-final2 -p tcp --tcp-flags SYN,ACK ACK -m recent --name http-recent-final --update -j ACCEPT
iptables -A http-recent-final2 -p tcp -m recent --name http-recent-final --tcp-flags SYN,ACK,FIN FIN --update -j ACCEPT
iptables -A http-recent-final2 -p tcp -m recent --name http-recent-final --tcp-flags SYN,ACK,FIN FIN,ACK --close -j ACCEPT


# http-recent chain, our homebrew state tracking system.
#
# Initial stage of the tcp connection SYN/ACK handshake
iptables -A http-recent -p tcp --tcp-flags SYN,ACK,FIN,RST SYN -m recent --name httplist --set -j ACCEPT
iptables -A http-recent -p tcp --tcp-flags SYN,ACK,FIN,RST SYN,ACK -m recent --name httplist --update -j ACCEPT
# Note that at this state in a connection, RST packets are legal (see RFC 793).
iptables -A http-recent -p tcp --tcp-flags SYN,ACK,FIN ACK -m recent --name httplist --update -j ACCEPT

# Middle stage of tcp connection where data transportation takes place.
iptables -A http-recent -p tcp --tcp-flags SYN,ACK NONE -m recent --name httplist --update -j ACCEPT
iptables -A http-recent -p tcp --tcp-flags SYN,ACK ACK -m recent --name httplist --update -j ACCEPT

# Final stage of tcp connection where one of the parties tries to close the
# connection.
iptables -A http-recent -p tcp --tcp-flags SYN,FIN,ACK FIN -m recent --name httplist --update -j ACCEPT
iptables -A http-recent -p tcp --tcp-flags SYN,FIN,ACK FIN,ACK -m recent --name httplist -j http-recent-final

# Special case if the connection crashes for some reason. Malicious intent or
# no.
iptables -A http-recent -p tcp --tcp-flags SYN,FIN,ACK,RST RST -m recent --name httplist --remove -j ACCEPT

Briefly, this is a poor replacement for the state engine available in netfilter. This version was created with a http server in mind, but will work with any TCP connection. First we have created two chains named http-recent and http-recent-final.

The http-recent chain is used in the starting stages of the connection, and for the actual data transmission, while the http-recent-final chain is used for the last and final FIN/ACK, FIN handshake.

This is a very bad replacement for the built in state engine and can not handle all of the possibilities that the state engine can handle.

However, it is a good example of what can be done with the recent match without being too specific. We should not use this example in a real world environment. It is slow, handles special cases badly, and should generally never be used more than as an example.

For example, it does not handle closed ports on connection, asynchronous FIN handshake (where one of the connected parties closes down, while the other continues to send data), etc.

Let us follow a packet through the example ruleset. First a packet enters the INPUT chain, and we send it to the http-recent chain.

The first packet should be a SYN packet, and should not have the ACK,FIN or RST bits set. Hence it is matched using the --tcp-flags SYN,ACK,FIN,RST SYN line. At this point we add the connection to the httplist using -m recent --name httplist --set line. Finally we accept the packet.
After the first packet we should receive a SYN/ACK packet to acknowledge that the SYN packet was received. This can be matched using the --tcp-flags SYN,ACK,FIN,RST SYN,ACK line. FIN and RST should be illegal at this point as well. At this point we update the entry in the httplist using -m recent --name httplist --update and finally we ACCEPT the packet.
By now we should get a final ACK packet, from the original creator of the connection, to acknowledge the SYN/ACK sent by the server. SYN, FIN and RST are illegal at this point of the connection, so the line should look like --tcp-flags SYN,ACK,FIN,RST ACK. We update the list in exactly the same way as in the previous step, and ACCEPT it.
At this point the data transmission can start. The connection should never contain any SYN packet now, but it will contain ACK packets to acknowledge the data packets that are sent. Each time we see any packet like this, we update the list and ACCEPT the packets.
The transmission can be ended in two ways, the simplest is the RST packet. RST will simply reset the connection and it will die. With FIN/ACK, the other endpoint answers with a FIN, and this closes down the connection so that the original source of the FIN/ACK can no longer send any data. The receiver of the FIN, will still be able to send data, hence we send the connection to a final stage chain to handle the rest.
In the http-recent-final chain we check if the packet is still in the httplist, and if so, we send it to the http-recent-final1 chain. In that chain we remove the connection from the httplist and add it to the http-recent-final list instead. If the connection has already been removed and moved over to the http-recent-final list, we send te packet to the http-recent-final2 chain.
In the final http-recent-final2 chain, we wait for the non-closed side to finish sending its data, and to close the connection from their side as well. Once this is done, the connection is completely removed.

As we can see, the recent list can become quite complex, but it will give us a huge set of possibilities if need be. Still, we should try and remember not to reinvent the wheel. If the ability we need is already implemented, we should try to use it instead of creating our own solution.

State match

The state match extension is used in conjunction with the connection tracking code in the kernel.

The state match accesses the connection tracking state of the packets from the state machine. This allows us to know in what state the connection is, and works for pretty much all protocols, including stateless protocols such as UDP and ICMP.

In all cases, there will be a default timeout for the connection and it will then be dropped from the connection tracking database. This match needs to be loaded explicitly by adding a -m state statement to the rule. We will then have access to one new match called state. The concept of state matching is covered more fully in the state machine section.

--state
- Example: iptables -A INPUT -m state --state RELATED,ESTABLISHED
- Explanation: This match option tells the state match what states the packets must be in to be matched. There are currently 4 states that can be used. INVALID, ESTABLISHED, NEW and RELATED. INVALID means that the packet is associated with no known stream or connection and that it may contain faulty data or headers. ESTABLISHED means that the packet is part of an already established connection that has seen packets in both directions and is fully valid. NEW means that the packet has or will start a new connection, or that it is associated with a connection that has not seen packets in both directions. Finally, RELATED means that the packet is starting a new connection and is associated with an already established connection. This could for example mean an FTP data transfer, or an ICMP error associated with a TCP or UDP connection. Note that the NEW state does not look for SYN bits in TCP packets trying to start a new connection and should, hence, not be used unmodified in cases where we have only one packet filter and no load balancing between different packet filters. However, there may be times where this could be useful.

Tcpmss match

The tcpmss match is used to match a packet based on the MSS (Maximum Segment Size) in TCP. This match is only valid for SYN and SYN/ACK packets. This match is loaded using -m tcpmss and takes only one option.

--mss
- Example: iptables -A INPUT -p tcp --tcp-flags SYN,ACK,RST SYN -m tcpmss --mss 2000:2500
- Explanation: The --mss option tells the tcpmss match which MSS to match. This can either be a single specific MSS value, or a range of MSS values separated by a :. The value may also be inverted as usual using the ! sign, as in the following example: -m tcpmss ! --mss 2000:2500. This example will match all MSS values, except for values in the range 2000 through 2500.

Tos match

The TOS (Type of Services) match can be used to match packets based on their TOS field — it consists of 8 bits, and is located in the IP header.

This match is loaded explicitly by adding -m tos to the rule. TOS is normally used to inform intermediate hosts of the precedence of the stream and its content (it does not really, but it informs of any specific requirements for the stream, such as it having to be sent as fast as possible, or it needing to be able to send as much payload as possible).

How different routers and administrators deal with these values depends. Most do not care at all, while others try their best to do something good with the packets in question and the data they provide.

--tos
- Example: iptables -A INPUT -p tcp -m tos --tos 0x16
- Explanation: This match is used as described above. It can match packets based on their TOS field and their value. This could be used, among other things together with the iproute2 and advanced routing functions in Linux, to mark packets for later usage. The match takes a hex or numeric value as an option, or possibly one of the names resulting from iptables -m tos -h. At the time of writing it contained the following named values: Minimize-Delay 16 (0x10), Maximize-Throughput 8 (0x08), Maximize-Reliability 4 (0x04), Minimize-Cost 2 (0x02), and Normal-Service 0 (0x00). Minimize-Delay means to minimize the delay in putting the packets through — example of standard services that would require this include telnet, SSH (Secure Shell) and FTP-control. Maximize-Throughput means to find a path that allows as big a throughput as possible — a standard protocol would be FTP-data. Maximize-Reliability means to maximize the reliability of the connection and to use lines that are as reliable as possible — a couple of typical examples are BOOTP and TFTP (Trivial File Transfer Protocol). Minimize-Cost means minimizing the cost of packets getting through each link to the client or server e.g. for finding the route that costs the least to travel along. Examples of normal protocols that would use this would be RTSP (Real Time Stream Control Protocol) and other streaming video/radio protocols. Finally, Normal-Service would mean any normal protocol that has no special needs.

Ttl match

The TTL (Time to Live) match is used to match packets based on their TTL field residing in the IP headers.

The TTL field contains 8 bits of data and is decremented once every time it is processed by an intermediate machine between the client and recipient host.

If the TTL reaches 0, an ICMP type 11 code 0 (TTL equals 0 during transit) or code 1 (TTL equals 0 during reassembly) is transmitted to the party sending the packet and informing it of the problem. This match is only used to match packets based on their TTL, and not to change anything. The latter, incidentally, applies to all kinds of matches. To load this match, we need to add an -m ttl to the rule.

--ttl-eq
- Example: iptables -A OUTPUT -m ttl --ttl-eq 60
- Explanation: This match option is used to specify the TTL value to match exactly. It takes a numeric value and matches this value within the packet. There is no inversion and there are no other specifics to match. It could, for example, be used for debugging our LAN e.g. LAN hosts that seem to have problems connecting to hosts on the Internet, or to find possible ingress by Trojans etc. The usage is relatively limited, however, its usefulness really depends on our imagination. One example would be to find hosts with bad default TTL values (could be due to a badly implemented TCP/IP stack, or simply to misconfiguration).
--ttl-gt
- Example: iptables -A OUTPUT -m ttl --ttl-gt 64
- Explanation: This match option is used to match any TTL greater than the specified value. The value can be between 0 and 255 and the match cannot be inverted. It could, for example, be used for matching any TTL greater than a specific value and then force them to a standardized value. This could be used to overcome some simple forms of spying by ISP's to find out if we are running multiple machines behind a packet filter, against their policies.
--ttl-lt
- Example: iptables -A OUTPUT -m ttl --ttl-lt 64
- Explanation: The --ttl-lt match is used to match any TTL smaller than the specified value. It is pretty much the same as the --ttl-gt match, but as already stated, it matches smaller TTL's. It could also be used in the same way as the --ttl-gt match, or to simply homogenize the packets leaving our network in general.

Unclean match

The unclean match takes no options and requires no more than explicitly loading it when you want to use it.

The unclean match tries to match packets that seem malformed or unusual, such as packets with bad headers or checksums and so on. This could be used to DROP connections and to check for bad streams, for example. However we should be aware that it could possibly break legal connections — it is regarded as experimental and may not work at all times, nor will it take care of all unclean packages or problems.

Targets and Jumps

The target/jumps tells the rule what to do with a packet that is a perfect match with the match section of the rule.

There are a couple of basic targets, the ACCEPT and DROP targets, which we will deal with first. However, before we do that, let us have a brief look at how a jump is done.

Jump

The jump specification is done in exactly the same way as in the target definition, except that it requires a chain within the same table to jump to. To jump to a specific chain, it is of course a prerequisite that chain exists.

As we have already explained, a user-specified chain is created with the -N command. For example, let us say we create a chain in the filter table called tcp_packets, like this iptables -N tcp_packets. We could then add a jump target to it like this iptables -A INPUT -p tcp -j tcp_packets.

We would then jump from the INPUT chain to the tcp_packets chain and start traversing that chain. When/if we reach the end of that chain, we get dropped back to the INPUT chain and the packet starts traversing from the rule one step below where it jumped to the other chain (tcp_packets in this case).

If a packet is ACCEPT ed within one of the sub-chains, it will be ACCEPT ed in the superior chain also and it will not traverse any of the superset chains any further.

However, do note that the packet will traverse all other chains in the other tables in a normal fashion.

Target

Targets on the other hand specify an action to take on the packet in question. We could for example, DROP or ACCEPT the packet depending on what we want to do.

There are also a number of other actions we may want to take, which we will describe further on in this section. Jumping to targets may incur different results, as it were. Some targets will cause the packet to stop traversing that specific chain and superior chains as described above.

Good examples of such rules are DROP and ACCEPT. Packets that are stopped, will not pass through any of the rules further on in the chain or in superior chains. Other targets, may take an action on the packet, after which the packet will continue passing through the rest of the rules.

A good example of this would be the LOG, ULOG and TOS targets. These targets can log the packets, mangle them and then pass them on to the other rules in the same set of chains. We might, for example, want this so that we in addition can mangle both the TTL (Time to Live) and the TOS (Type of Services) values of a specific packet/stream.

Some targets will accept extra options (what TOS value to use etc), while others do not necessarily need any options — but we can include them if we want to (log prefixes, masquerade-to ports and so on).

ACCEPT target

This target needs no further options. As soon as the match specification for a packet has been fully satisfied, and we specify ACCEPT as the target, the rule is accepted and will not continue traversing the current chain or any other ones in the same table.

Note however, that a packet that was accepted in one chain might still travel through chains within other tables, and could still be dropped there. There is nothing special about this target whatsoever, and it does not require, nor have the possibility of, adding options to the target. To use this target, we simply specify -j ACCEPT.

CLASSIFY target

The CLASSIFY target can be used to classify packets in such a way that can be used by a couple of different qdiscs (Queue Disciplines). For example, atm, cbq, dsmark, pfifo_fast, htb and the prio qdiscs. The CLASSIFY target is only valid in the POSTROUTING chain of the mangle table.

For more information about qdiscs and traffic controlling, please visit the Linux Advanced Routing and Traffic Control HOW-TO webpage.

--set-class
- Example: iptables -t mangle -A POSTROUTING -p tcp --dport 80 -j CLASSIFY --set-class 20:10
- Explanation: The CLASSIFY target only takes one argument, the --set-class. This tells the target how to class the packet. The class takes 2 values separated by a coma sign, like this MAJOR:MINOR.

CLUSTERIP target

The CLUSTERIP target is used to create simple clusters of nodes answering to the same IP and MAC address in a round robin fashion.

This is a simple form of clustering where we set up a virtual IP on all hosts participating in the cluster, and then use the CLUSTERIP on each machine that is supposed to answer the requests.

The CLUSTERIP match requires no special load balancing hardware or machines, it simply does its work on each machine part of the cluster of machines. It is a very simple clustering solution and not suited for large and complex clusters, neither does it have built in heartbeat handling, but it should be easily implemented as a simple script.

All servers in the cluster uses a common Multicast MAC for a virtual IP, and then a special hash algorithm is used within the CLUSTERIP target to figure out who of the cluster participants should respond to each connection.

A Multicast MAC is a MAC address starting with 01:00:5e as the first 24 bits — an example of a Multicast MAC would be 01:00:5e:00:00:20. The virtual IP can be any IP address, but must be the same on all hosts as well.

Remember that the CLUSTERIP might break protocols such as SSH etc. The connection will go through properly, but if we try the same time again to the same host, we might be connected to another machine in the cluster, with a different keyset, and hence our SSH client might refuse to connect or give we errors.

For this reason, this will not work very well with some protocols, and it might be a good idea to add separate addresses that can be used for maintenance and administration. Another solution is to use the same SSH keys on all hosts participating in the cluster (which I think is a bad idea however). The cluster can be load-balanced with three kinds of hashmodes.

The first one is only source IP (sourceip)
the second is source IP and source port (sourceip-sourceport) and
the third one is source IP, source port and destination port (sourceip-sourceport-destport).

The first one might be a good idea where we need to remember states between connections, for example a web server with a shopping cart that keeps state between connections, this load-balancing might become a little bit uneven (different machines might get a higher loads than others, etc.) since connections from the same source IP will go to the same server.

The sourceip-sourceport hash might be a good idea where we want to get the load-balancing a little bit more even, and where state does not have to be kept between connections on each server.

For example, a large informational webpage with perhaps a simple search engine might be a good idea here. The third and last hashmode, sourceip-sourceport-destport, might be a good idea where we have a machine with several services running that does not require any state to be preserved between connections.

This might for example be a simple NTP, DNS and WWW server on the same host. Each connection to each new destination would hence be renegotiated — actually no negotiation goes on, it is basically just a round robin system and each machine receives one connection each.

Each CLUSTERIP cluster gets a separate file in the /proc/net/ipt_CLUSTERIP directory, based on the virtual IP of the cluster. If the VIP is 192.168.0.5 for example, we could cat /proc/net/ipt_CLUSTERIP/192.168.0.5 to see which nodes this machine is answering for.

To make the machine answer for another machine, lets say node 2, we add it using echo "+2" >> /proc/net/ipt_CLUSTERIP/192.168.0.5. To remove it, we run echo "-2" >> /proc/net/ipt_CLUSTERIP/192.168.0.5.

--new
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 80 -j CLUSTERIP --new...
- Explanation: This creates a new CLUSTERIP entry. It must be set on the first rule for a virtual IP, and is used to create a new cluster. If we have several rules connecting to the same CLUSTERIP we can omit the --new keyword in any secondary references to the same virtual IP.
--hashmode
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 443 -j CLUSTERIP --new --hashmode sourceip...
- Explanation: The --hashmode keyword specifies the kind of hash that should be created. The hashmode can be any of the following three: sourceip, sourceip-sourceport and sourceip-sourceport-destport. Basically, sourceip will give better performance and simpler states between connections, but not as good load-balancing between the machines. sourceip-sourceport will give a slightly slower hashing and not as good to maintain states between connections, but will give better load-balancing properties. The last one may create very slow hashing that consumes a lot of memory, but will on the other hand also create very good load-balancing properties.
--clustermac
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 80 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5e:00:00:20...
- Explanation: The MAC address that the cluster is listening to for new connections. This is a shared multicast MAC address that all the hosts are listening to.
--total-nodes
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 80 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5e:00:00:20 --total-nodes 2...
- Explanation: The --total-nodes keyword specifies how many hosts are participating in the cluster and that will answer to requests.
--local-node
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 80 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
- Explanation: This is the number that this machine has in the cluster. The cluster answers in a round-robin fashion, so once a new connection is made to the cluster, the next machine answers, and then the next after that, and so on.
--hash-init
- Example: iptables -A INPUT -p tcp -d 192.168.0.5 --dport 80 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5e:00:00:20 --hash-init 1234
- Explanation: Specifies a random seed for hash initialization.

CONNMARK target

The CONNMARK target is used to set a mark on a whole connection, much the same way as the MARK target does. It can then be used together with the connmark match to match the connection in the future.

For example, say we see a specific pattern in a header, and we do not want to mark just that packet, but the whole connection. The CONNMARK target is a perfect solution in that case.

The CONNMARK target is available in all chains and all tables, but remember that the nat table is only traversed by the first packet in a connection, so the CONNMARK target will have no effect if we try to use it for subsequent packets after the first one in here.

--set-mark
- Example: iptables -t nat -A PREROUTING -p tcp --dport 80 -j CONNMARK --set-mark 4
- Explanation: This option sets a mark on the connection. The mark can be an unsigned long int, which means values between 0 and 4294967295l is valid. Each bit can also be masked by doing --set-mark 12/8. This will only allow the bits in the mask to be set out of all the bits in the mark. In this example, only the 4th bit will be set, not the 3rd. 12 translates to 1100 in binary, and 8 to 1000, and only the bits set in the mask are allowed to be set. Hence, only the 4th bit, or 8, is set in the actual mark.
--save-mark
- Example: iptables -t mangle -A PREROUTING --dport 80 -j CONNMARK --save-mark
- Explanation: The --save-mark target option is used to save the packet mark into the connection mark. For example, if we have set a packet mark with the MARK target, we can then move this mark to mark the whole connection with the --save-mark match. The mark can also be masked by using the --mask option described further down.
--restore-mark
- Example: iptables -t mangle -A PREROUTING --dport 80 -j CONNMARK --restore-mark
- Explanation: This target option restores the packet mark from the connection mark as defined by the CONNMARK. A mask can also be defined using the --mask option as seen below. If a mask is set, only the masked options will be set. Note that this target option is only valid for use in the mangle table.
--mask
- Example: iptables -t mangle -A PREROUTING --dport 80 -j CONNMARK --restore-mark --mask 12
- Explanation: The --mask option must be used in unison with the --save-mark and --restore-mark options. The --mask option specifies an and-mask that should be applied to the mark values that the other two options will give. For example, if the restored mark from the above example would be 15, it would mean that the mark was 1111 in binary, while the mask is 1100. 1111 and 1100 equals 1100.

CONNSECMARK target

The CONNSECMARK target sets a SELinux security context mark to or from a packet mark. The target is only valid in the mangle table and is used together with the SECMARK target, where the SECMARK target is used to set the original mark, and then the CONNSECMARK is used to set the mark on the whole connection.

SELinux is beyond the scope of this page, but basically it is an addition of MAC (Mandatory Access Control) to Linux. This is more finegrained than the original security systems of most Linux and Unix security controls.

Each object can have security attributes, or security context, connected to it, and these attributes are then matched to eachother before allowing or denying a specific task to be performed. This target will allow a security context to be set on a connection.

--save
- Example: iptables -t mangle -A PREROUTING -p tcp --dport 80 -j CONNSECMARK --save
- Explanation: Save the security context mark from the packet to the connection if the connection is not marked since before.
--restore
- Example: iptables -t mangle -A PREROUTING -p tcp --dport 80 -j CONNSECMARK --restore
- Explanation: If the packet has no security context mark set on it, the --restore option will set the security context mark associated with the connection on the packet.

DNAT target

The DNAT (Destination Network Address Translation) target is used to do DNAT, which means that it is used to rewrite the Destination IP address of a packet.

If a packet is matched, and this is the target of the rule, the packet, and all subsequent packets in the same stream will be translated, and then routed on to the correct device, machine or network.

This target can be extremely useful, for example,when we have a machine running our web server inside a LAN, but no real IP to give it that will work on the Internet. We could then tell the packet filter to forward all packets going to its own HTTP port, on to the real web server within the LAN.

We may also specify a whole range of destination IP addresses, and the DNAT mechanism will choose the destination IP address at random for each stream. Hence, we will be able to deal with a kind of load balancing by doing this.

Note that the DNAT target is only available within the PREROUTING and OUTPUT chains in the nat table, and any of the chains called upon from any of those listed chains. Note that chains containing DNAT targets may not be used from any other chains, such as the POSTROUTING chain.

--to-destination
- Example: iptables -t nat -A PREROUTING -p tcp -d 15.45.23.67 --dport 80 -j DNAT --to-destination 192.168.1.1-192.168.1.10
- Explanation: The --to-destination option tells the DNAT mechanism which destination IP to set in the IP header, and where to send packets that are matched. The above example would send on all packets destined for IP address 15.45.23.67 to a range of LAN IP's, namely 192.168.1.1 through 192.168.1.10. Note, as described previously, that a single stream will always use the same machine, and that each stream will randomly be given an IP address that it will always be destined for, within that stream. We could also have specified only one IP address, in which case we would always be connected to the same machine. Also note that we may add a port or port range to which the traffic would be redirected to. This is done by adding, for example, an :80 statement to the IP addresses to which we want to DNAT the packets. A rule could then look like --to-destination 192.168.1.1:80 for example, or like --to-destination 192.168.1.1:80-100 if we wanted to specify a port range. As we can see, the syntax is pretty much the same for the DNAT target, as for the SNAT target even though they do two totally different things. Do note that port specifications are only valid for rules that specify the TCP or UDP protocols with the --protocol option.

Since DNAT requires quite a lot of work to work properly, I have decided to add a larger explanation on how to work with it. Let us take a brief example on how things would be done normally. We want to publish our website via our Internet connection. We only have one IP address, and the HTTP server is located on our internal network.

Our packet filter has the external IP address $INET_IP, and our HTTP server has the internal IP address $HTTP_IP and finally the packet filter has the internal IP address $LAN_IP. The first thing to do is to add the following simple rule to the PREROUTING chain in the nat table: iptables -t nat -A PREROUTING --dst $INET_IP -p tcp --dport 80 -j DNAT --to-destination $HTTP_IP.

Now, all packets from the Internet going to port 80 on our packet filter are redirected (or DNAT'ed) to our internal HTTP server. If we test this from the Internet, everything should work just perfect.

So, what happens if we try connecting from a machine on the same local network as the HTTP server? It will simply not work. This is a problem with routing really. We start out by dissecting what happens in a normal case. The external box has IP address $EXT_BOX, to maintain readability.

The IP packet leaves the connecting machine going to $INET_IP and source $EXT_BOX.
The IP packet reaches the packet filter.
Packet Filter DNAT's the packet and runs the packet through all different chains etc.
Packet leaves the packet filter and travels to the $HTTP_IP.
Packet reaches the HTTP server, and the HTTP box replies back through the packet filter, if that is the box that the routing database has entered as the gateway for $EXT_BOX. Normally, this would be the default gateway of the HTTP server.
Packet Filter Un-DNAT's the packet again, so the packet looks as if it was replied to from the packet filter itself.
Reply packet travels as usual back to the client $EXT_BOX.

Now, we will consider what happens if the packet was instead generated by a client on the same network as the HTTP server itself. The client has the IP address $LAN_BOX, while the rest of the machines maintain the same settings.

The IP packet leaves $LAN_BOX to $INET_IP.
The packet reaches the packet filter.
The packet gets DNAT'ed, and all other required actions are taken, however, the packet is not SNAT'ed, so the same source IP address is used on the packet.
The packet leaves the packet filter and reaches the HTTP server.
The HTTP server tries to respond to the packet, and sees in the routing databases that the packet came from a local box on the same network, and hence tries to send the packet directly to the original source IP address (which now becomes the destination IP address).
The packet reaches the client, and the client gets confused since the return packet does not come from the machine that it sent the original request to. Hence, the client drops the reply packet, and waits for the real reply.

The simple solution to this problem is to SNAT all packets entering the packet filter and leaving for a machine or IP that we know we DNAT.

For example, consider the above rule. We SNAT the packets entering our packet filter that are destined for $HTTP_IP port 80 so that they look as if they came from $LAN_IP. This will force the HTTP server to send the packets back to our packet filter, which Un-DNAT's the packets and sends them on to the client. The rule would look something like this: iptables -t nat -A POSTROUTING -p tcp --dst $HTTP_IP --dport 80 -j SNAT --to-source $LAN_IP.

Remember that the POSTROUTING chain is processed last of the chains, and hence the packet will already be DNAT'ed once it reaches that specific chain. This is the reason that we match the packets based on the internal address.

This last rule will seriously harm our logging, so it is really advisable not to use this method, but the whole example is still a valid one.

What will happen is this: the IP packet comes from the Internet, gets SNAT'ed and DNAT'ed, and finally hits the HTTP server (for example). The HTTP server now only sees the request as if it was coming from the packet filter, and hence logs all requests from the internet as if they came from the packet filter.

This can also have even more severe implications. Take an SMTP (Simple Mail Transfer Protocol) server on the LAN, that allows requests from the internal network, and we have our packet filter set up to forward SMTP traffic to it. We have now effectively created an open relay SMTP server, with horrenduously bad logging.

One solution to this problem is to simply make the SNAT rule even more specific in the match part, and to only work on packets that come in from our LAN interface i.e. add a --src $LAN_IP_RANGE to the whole command as well. This will make the rule only work on streams that come in from the LAN, and hence will not affect the Source IP, so the logs will look correct, except for streams coming from our LAN.

We will, be better off solving these problems by either setting up a separate DNS server for our LAN, or to actually set up a separate DMZ (Demilitarized Zone), the latter being preferred if we have the money.

There is one final aspect to this whole scenario. What if the packet filter itself tries to access the HTTP server, where will it go? As it looks now, it will unfortunately try to get to its own HTTP server, and not the server residing on $HTTP_IP. To get around this, we need to add a DNAT rule in the OUTPUT chain as well. Following the above example, this should look something like the following: iptables -t nat -A OUTPUT --dst $INET_IP -p tcp --dport 80 -j DNAT --to-destination $HTTP_IP.

Adding this final rule should get everything up and running. All separate networks that do not sit on the same net as the HTTP server will run smoothly, all machines on the same network as the HTTP server will be able to connect and finally, the packet filter will be able to do proper connections as well. Now everything works and no problems should arise.

Everyone should realize that these rules only affect how the packet is DNAT'ed and SNAT'ed properly. In addition to these rules, we may also need extra rules in the filter table (FORWARD chain) to allow the packets to traverse through those chains as well.

Do not forget that all packets have already gone through the PREROUTING chain, and should hence have their destination addresses rewritten already by DNAT.

DROP target

The DROP target does just what it says, it drops packets dead and will not carry out any further processing.

A packet that matches a rule perfectly and is then dropped will be blocked. Note that this action might in certain cases have an unwanted effect, since it could leave dead sockets around on either machine.

A better solution in cases where this is likely would be to use the REJECT target, especially when we want to block port scanners from getting too much information, such as on filtered ports and so on.

Also note that if a packet has the DROP action taken on it in a subchain, the packet will not be processed in any of the main chains either in the present or in any other table i.e. the packet simply vanished.

As we have seen previously, the target will not send any kind of information in either direction, nor to intermediaries such as routers.

DSCP target

This is a target (there is also the DSCP match) that changes the DSCP (Differentiated Services Code Point) marks inside a packet.

The DSCP target is able to set any DSCP value inside a TCP packet, which is a way of telling routers the priority of the packet in question. For more information about DSCP, look at the RFC 2474.

Basically, DSCP is a way of differentiating different services into separate categories, and based on this, give them different priority through the routers. This way, we can give interactive TCP sessions (such as XMPP, telnet, SSH, POP3) a very high fast connection, that may not be very suitable for large bulk transfers.

If on the other hand the connection is one of low importance (SMTP, or whatever we classify as low priority), we could send it over a large bulky network with worse latency than the other network, that is cheaper to utilize than the faster and lower latency connections.

--set-dscp
- Example: iptables -t mangle -A FORWARD -p tcp --dport 80 -j DSCP --set-dscp 1
- Explanation: This sets the DSCP value to the specified value. The values can be set either via class (see below) or with the --set-dscp, which takes either an integer value, or a hex value.
--set-dscp-class
- Example: iptables -t mangle -A FORWARD -p tcp --dport 80 -j DSCP --set-dscp-class EF
- Explanation: This sets the DSCP field according to a predefined DiffServ class. Some of the possible values are EF, BE and the CSxx and AFxx values available. Do note that the --set-dscp-class and --set-dscp commands are mutually exclusive, which means we cannot use both of them in the same command! For more information go here.

ECN target

Simply put, the ECN (Explicit Congestion Notification) target can be used to reset the ECN bits from the IPv4 header, or to put it correctly, reset them to 0 at least (there is also the ECN match).

Since ECN is a relatively new thing on the net, there are problems with it. For example, it uses 2 bits that are defined in the original RFC for the TCP protocol to be 0. Some routers and other Internet appliances will not forward packets that have these bits set to 1.

If we want to make use of at least parts of the ECN functionality from our machines, we could for example reset the ECN bits to 0 for specific networks that we know we are having troubles reaching because of ECN.

Please do note that it is not possible to turn ECN on in the middle of a stream. It is not allowed according to the RFC's, and it is not possible anyway. Both endpoints of the stream must negotiate ECN. If we turn it on, then one of the machines is not aware of it, and cannot respond properly to the ECN notifications.

--ecn-tcp-remove
- Example: iptables -t mangle -A FORWARD -p tcp --dport 80 -j ECN --ecn-tcp-remove
- Explanation: The ECN target only takes one argument, the --ecn-tcp-remove argument. This tells the target to remove the ECN bits inside the TCP headers.

LOG target options

The LOG target is specially designed for logging detailed information about packets.

These could, for example, be considered as illegal. Or, logging can be used purely for bug hunting and error finding. The LOG target will return specific information on packets, such as most of the IP headers and other information considered interesting. It does this via the kernel logging facility, normally syslogd.

This information may then be read directly with dmesg, or from the syslogd logs, or with other programs or applications. This is an excellent target to use to debug our rule-sets, so that we can see what packets go where and what rules are applied on what packets.

Note as well that it could be a really great idea to use the LOG target instead of the DROP target while we are testing a rule we are not 100% sure about on a production packet filter, since a syntax error in the rule-sets could otherwise cause severe connectivity problems for our users.

Also note that the ULOG target may be interesting if we are using really extensive logging, since the ULOG target has support for direct logging to MySQL databases and suchlike.

Note that if we get undesired logging direct to consoles, this is not an iptables or netfilter problem, but rather a problem caused by our syslogd configuration in /etc/syslog.conf.

We may also need to tweak our dmesg settings. dmesg is the command that changes which errors from the kernel that should be shown on the console. dmesg -n 1 should prevent all messages from showing up on the console, except panic messages. The dmesg message levels matches exactly the syslogd levels, and it only works on log messages from the kernel facility.

The LOG target currently takes five options that could be of interest if we have specific information needs, or want to set different options to specific values:

--log-level
- Example: iptables -A FORWARD -p tcp -j LOG --log-level debug
- Explanation: This is the option to tell netfilter and syslog which log level to use. For a complete list of log levels take a look at man 5 syslog.conf. Normally there are the following log levels, or priorities as they are normally referred to: debug, info, notice, warning, warn, err, error, crit, alert, emerg and panic. The keyword error is the same as err, warn is the same as warning and panic is the same as emerg. Note that all three of these are deprecated i.e. we should not use error, warn and panic. The priority defines the severity of the message being logged. All messages are logged through the kernel facility i.e. setting kern.=info /var/log/iptables in /etc/syslog.conf and then letting all our LOG messages in iptables use log level info, would make all messages appear in the /var/log/iptables file. Note that there may be other messages here as well from other parts of the kernel that uses the info priority.
--log-prefix
- Example: iptables -A INPUT -p tcp -j LOG --log-prefix "INPUT packets"
- Explanation: This option tells iptables to prefix all log messages with a specific prefix, which can then easily be combined with grep or other tools to track specific problems and output from different rules. The prefix may be up to 29 letters long, including white-spaces and other special symbols.
--log-tcp-sequence
- Example: iptables -A INPUT -p tcp -j LOG --log-tcp-sequence
- Explanation: This option will log the TCP Sequence numbers, together with the log message. The TCP Sequence numbers are special numbers that identify each packet and where it fits into a TCP sequence, as well as how the stream should be reassembled. Note that this option constitutes a security risk if the logs are readable by unauthorized users.
--log-tcp-options
- Example: iptables -A FORWARD -p tcp -j LOG --log-tcp-options
- Explanation: The --log-tcp-options option logs the different options from the TCP packet headers and can be valuable when trying to debug what could go wrong, or what has actually gone wrong. This option does not take any variable fields or anything like that, just as most of the LOG options do not.
--log-ip-options
- Example: iptables -A FORWARD -p tcp -j LOG --log-ip-options
- Explanation: The --log-ip-options option will log most of the IP packet header options. This works exactly the same as the --log-tcp-options option, but instead works on the IP options. These logging messages may be valuable when trying to debug or track specific culprits, as well as for debugging — in just the same way as the previous option.

MARK target

The MARK target, next to the mark match, is used to set netfilter mark values that are associated with specific packets.

This target is only valid in the mangle table and will not work with any other table. The MARK values may be used in conjunction with the advanced routing capabilities in Linux to send different packets through different routes and to tell them to use different queue disciplines (qdisc), etc.

Note that the mark value is not set within the actual packet, but is a value that is associated within the kernel with the packet. In other words, we cannot set a MARK for a packet and then expect the MARK still to be there on another machine. If this is what we want, we will be better off with the TOS (Type of Services) target which will mangle the TOS value in the IP header.

--set-mark
- Example: iptables -t mangle -A PREROUTING -p tcp --dport 22 -j MARK --set-mark 2
- Explanation: The --set-mark option is required to set a mark. The --set-mark match takes an integer value. For example, we may set mark 2 on a specific stream of packets, or on all packets from a specific machine and then do advanced routing on that machine, to decrease or increase the network bandwidth, etc.

MASQUERADE target

The MASQUERADE target is used basically the same as the SNAT target, but it does not require any --to-source option. The reason for this is that the MASQUERADE target was made to work with, for example, dial-up connections, or DHCP (Dynamic Host Configuration Protocol) connections, which gets dynamic IP addresses when connecting to the network in question.

This means that we should only use the MASQUERADE target with dynamically assigned IP connections, which we do not know the actual address of at all times. If we have a static IP connection, we should instead use the SNAT target.

When we masquerade a connection, it means that we set the IP address used on a specific network interface instead of the --to-source option, and the IP address is automatically grabbed from the information about the specific interface.

The MASQUERADE target also has the effect that connections are forgotten when an interface goes down, which is extremely good if we, for example, kill a specific interface.

If we would have used the SNAT target, we may have been left with a lot of old connection tracking data, which would be lying around for days, swallowing up useful connection tracking memory. This is, in general, the correct behavior when dealing with dial-up lines that are probably assigned a different IP every time they are brought up. In case we are assigned a different IP, the connection is lost anyway, and it is unnecessary to keep the entry around.

It is still possible to use the MASQUERADE target instead of SNAT even though we do have a static IP, however, it is not favorable since it will add extra overhead, and there may be inconsistencies in the future which will thwart our existing scripts and render them unusable.

Note that the MASQUERADE target is only valid within the POSTROUTING chain in the nat table, just as the SNAT target. The MASQUERADE target takes one option specified below, which is optional.

--to-ports
- Example: iptables -t nat -A POSTROUTING -p TCP -j MASQUERADE --to-ports 1024-31000
- Explanation: The --to-ports option is used to set the source port or ports to use on outgoing packets. Either we can specify a single port like --to-ports 1025 or we may specify a port range as --to-ports 1024-3000. In other words, the lower port range delimiter and the upper port range delimiter separated with a hyphen. This alters the default SNAT port-selection as described in the SNAT target section. The --to-ports option is only valid if the rule match section specifies the TCP or UDP protocols with the --protocol match.

MIRROR target

Be warned, the MIRROR is dangerous and was only developed as an example code of the new conntrack and NAT code. It can cause dangerous things to happen, and very serious DoS (Denial of Service) attacks will be possible if used improperly. It was removed from 2.5 and 2.6 kernels due to its bad security implications!

The MIRROR target is an experimental and demonstration target only, and we are warned against using it, since it may result in really bad loops hence, among other things, resulting in serious DoS.

The MIRROR target is used to invert the source and destination fields in the IP header, and then to retransmit the packet. This can cause some really funny effects, and I will bet that, thanks to this target, not just one red faced cracker has cracked his own box by now.

The effect of using this target is stark, to say the least. Let's say we set up a MIRROR target for port 80 at computer A. If machine B were to come from yahoo.com, and try to access the HTTP server at machine A, the MIRROR target would return the yahoo machine's own web page (since this is where the request came from).

Note that the MIRROR target is only valid within the INPUT, FORWARD and PREROUTING chains, and any user-specified chains which are called from those chains.

Also note that outgoing packets resulting from the MIRROR target are not seen by any of the normal chains in the filter, nat or mangle tables, which could give rise to loops and other problems. This could make the target the cause of unforeseen headaches.

For example, a machine might send a spoofed packet to another machine that uses the MIRROR command with a TTL (Time to Live) of 255, at the same time spoofing its own packet, so as to seem as if it comes from a third machine that uses the MIRROR command. The packet will then bounce back and forth incessantly, for the number of hops there are to be completed.

If there is only 1 hop, the packet will jump back and forth 240-255 times. Not bad for a cracker, in other words, to send 1500 bytes of data and eat up 380 kbyte of our connection. Note that this is a best case scenario for the cracker or script kiddie, whatever we want to call them.

NETMAP target

NETMAP is a new implementation of the SNAT and DNAT targets where the host part of the IP address is not changed. It provides a 1:1 NAT function for whole networks which is not available in the standard SNAT and DNAT functions.

For example, lets say we have a network containing 254 machines using private IP addresses (a /24 network), and we just got a new /24 network of public IP's. Instead of walking around and changing the IP of each and every one of the machines, we would be able to simply use the NETMAP target like -j NETMAP -to 10.5.6.0/24 and Et voilà, all the machines are seen as 10.5.6.x when they leave the packet filter. For example, 192.168.1.26 would become 10.5.6.26.

--to
- Example: iptables -t mangle -A PREROUTING -s 192.168.1.0/24 -j NETMAP --to 10.5.6.0/24
- Explanation: This is the only option of the NETMAP target. In the above example, the 192.168.1.x machines will be directly translated into 10.5.6.x.

NFQUEUE target

The NFQUEUE target is used much the same way as the QUEUE target, and is basically an extension of it. The NFQUEUE target allows for sending packets for separate and specific queues. The queue is identified by a 16-bit ID. This target requires the nfnetlink_queue kernel support to run.

--queue-num
- Example: iptables -t nat -A PREROUTING -p tcp --dport 80 -j NFQUEUE --queue-num 30
- Explanation: The --queue-num option specifies which queue to use and to send the queued data to. If this option is skipped, the default queue 0 is used. The queue number is a 16 bit unsigned integer, which means it can take any value between 0 and 65535. The default 0 queue is also used by the QUEUE target.

NOTRACK target

This target is used to turn off connection tracking for all packets matching this rule. The target is only valid inside the raw table.

The target takes no options and is very easy to use. Match the packets we wish to not track, and then set the NOTRACK target on the rules matching the packets we do not wish to track.

QUEUE target

The QUEUE target is used to queue packets to userspace programs and applications. It is used in conjunction with programs or utilities that are extraneous to netfilter and may be used, for example, with network accounting, or for specific and advanced applications which proxy or filter packets.

As of kernel 2.6.14 the behavior of netfilter has changed. A new system for talking to the QUEUE has been deviced, called the nfnetlink_queue. The QUEUE target is basically a pointer to the NFQUEUE 0 nowadays. For programming questions take a look at the nfnetlink_queue.ko module.

REDIRECT target

The REDIRECT target is used to redirect packets and streams to the machine itself.

This means that we could for example REDIRECT all packets destined for the HTTP ports to an HTTP proxy like squid, on our own machine. Locally generated packets are mapped to the 127.0.0.1 address.

In other words, this rewrites the destination address to our own machine for packets that are forwarded, or something alike. The REDIRECT target is extremely good to use when we want, for example, transparent proxying, where the LAN machines do not know about the proxy at all.

Note that the REDIRECT target is only valid within the PREROUTING and OUTPUT chains of the nat table. It is also valid within user-specified chain that are only called from those chains, and nowhere else. The REDIRECT target takes only one option, as described below:

--to-ports
- Example: iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 8080
- Explanation: The --to-ports option specifies the destination port, or port range, to use. Without the --to-ports option, the destination port is never altered. This is specified, as above, --to-ports 8080 in case we only want to specify one port. If we would want to specify a port range, we would do it like --to-ports 8080-8090, which tells the REDIRECT target to redirect the packets to the ports 8080 through 8090. Note that this option is only available in rules specifying the TCP or UDP protocol with the --protocol matcher, since it would not make any sense anywhere else.

REJECT target

The REJECT target works basically the same as the DROP target, but it also sends back an error message to the machine sending the packet that was blocked.

The REJECT target is as of today only valid in the INPUT, FORWARD and OUTPUT chains or their subchains. After all, these would be the only chains in which it would make any sense to put this target. Note that all chains that use the REJECT target may only be called by the INPUT, FORWARD, and OUTPUT chains, else they will not work.

There is currently only one option which controls the nature of how this target works, though this may in turn take a huge set of variables. Most of them are fairly easy to understand, if we have a basic knowledge of TCP/IP:

--reject-with
- Example: iptables -A FORWARD -p TCP --dport 22 -j REJECT --reject-with tcp-reset
- Explanation: This option tells the REJECT target what response to send to the machine that sent the packet that we are rejecting. Once we get a packet that matches a rule in which we have specified this target, our machine will first of all send the associated reply, and the packet will then be dropped dead, just as the DROP target would drop it. The following reject types are currently valid: icmp-net-unreachable, icmp-host-unreachable, icmp-port-unreachable, icmp-proto-unreachable, icmp-net-prohibited and icmp-host-prohibited. The default error message is to send a icmp-port-unreachable to the machine. All of the above are ICMP error messages and may be set as we wish. Finally, there is one more option called tcp-reset, which may only be used together with the TCP protocol. The tcp-reset option will tell REJECT to send a TCP RST packet in reply to the sending machine. TCP RST packets are used to close open TCP connections gracefully. As stated in the iptables man page, this is mainly useful for blocking ident probes which frequently occur when sending mail to broken mail hosts, that will not otherwise accept our mail.

RETURN target

The RETURN target will cause the current packet to stop traveling through the chain where it hit the rule.

If it is the subchain of another chain, the packet will continue to travel through the superior chains as if nothing had happened. If the chain is the main chain, for example the INPUT chain, the packet will have the default policy taken on it. The default policy is normally set to ACCEPT, DROP or similar.

For example, let us say a packet enters the INPUT chain and then hits a rule that it matches and that tells it to --jump EXAMPLE_CHAIN. The packet will then start traversing the EXAMPLE_CHAIN, and all of a sudden it matches a specific rule which has the --jump RETURN target set.

It will then jump back to the INPUT chain. Another example would be if the packet hits a --jump RETURN rule in the INPUT chain. It would then be dropped to the default policy as previously described, and no more actions would be taken in this chain.

SAME target

The SAME target works almost in the same fashion as the SNAT target, but it still differs. Basically, the SAME target will try to always use the same outgoing IP address for all connections initiated by a single machine on our network.

For example, say we have one 192.168.1.0/24 network and 3 IP addresses 10.5.6.7-9. Now, if 192.168.1.20 went out through 10.5.6.7 address the first time, the packet filter will try to keep that machine always going out through that IP address.

--to
- Example: iptables -t mangle -A PREROUTING -s 192.168.1.0/24 -j SAME --to 10.5.6.7-10.5.6.9
- Explanation: As we can see, the --to argument takes 2 IP addresses bound together by a - sign. These IP addresses, and all in between, are the IP addresses that we NAT to using the SAME algorithm.
--nodst
- Example: iptables -t mangle -A PREROUTING -s 192.168.1.0/24 -j SAME --to 10.5.6.7-10.5.6.9 --nodst
- Explanation: Under normal action, the SAME target is calculating the followup connections based on both destination and source IP addresses. Using the --nodst option, it uses only the source IP address to find out which outgoing IP the NAT function should use for the specific connection. Without this argument, it uses a combination of the destination and source IP address.

SECMARK target

The SECMARK target is used to set a security context mark on a single packet, as defined by SELinux and security systems. The SECMARK target is only valid in the mangle table.

In brief, SELinux is a new and improved security system to add MAC (Mandatory Access Control) to Linux, implemented by the NSA as a proof of concept. SELinux basically sets security attributes for different objects and then matches them into security contexts. The SECMARK target is used to set a security context on a packet which can then be used within the security subsystems to match on.

--selctx
- Example: iptables -t mangle -A PREROUTING -p tcp --dport 80 -j SECMARK --selctx httpcontext
- Explanation: The --selctx option is used to specify which security context to set on a packet. The context can then be used for matching inside the security systems of Linux.

SNAT target

The SNAT (Source Network Address Translation) target is used to do SNAT, which means that this target will rewrite the source IP address in the IP header of the IP packet.

This is what we want, for example, when several machines have to share an Internet connection. We can then turn on ip forwarding in the kernel, and write an SNAT rule which will translate all packets going out from our local network to the source IP of our own Internet connection.

Without doing this, the outside world would not know where to send reply packets, since our local networks mostly use the IANA (Internet Assigned Numbers Authority) specified IP addresses which are allocated for LAN networks.

If we forwarded these packets as is, no one on the Internet would know that they were actually from us. The SNAT target does all the translation needed to do this kind of work, letting all packets leaving our LAN look as if they came from a single machine, which would be our packet filter.

The SNAT target is only valid within the nat table, within the POSTROUTING chain i.e. this is the only chain in which we may use SNAT.

Only the first packet in a connection is mangled by SNAT, and after that all future packets using the same connection will also be SNATted. Furthermore, the initial rules in the POSTROUTING chain will be applied to all the packets in the same stream.

--to-source
- Example: iptables -t nat -A POSTROUTING -p tcp -o eth0 -j SNAT --to-source 194.236.50.155-194.236.50.160:1024-32000
- Explanation: The --to-source option is used to specify which source the IP packet should use. This option, at its simplest, takes one IP address which we want to use for the source IP address in the IP header. If we want to balance between several IP addresses, we can use a range of IP addresses, separated by a hyphen. The --to--source IP numbers could then, for instance, be something like in the above example: 194.236.50.155-194.236.50.160. The source IP for each stream that we open would then be allocated randomly from these, and a single stream would always use the same IP address for all packets within that stream. We can also specify a range of ports to be used by SNAT. All the source ports would then be confined to the ports specified. The port bit of the rule would then look like in the example above, :1024-32000. This is only valid if -p tcp or -p udp was specified somewhere in the match of the rule in question. netfilter will always try to avoid making any port alterations if possible, but if two machines try to use the same ports, then netfilter will map one of them to another port. If no port range is specified, then if they are needed, all source ports below 512 will be mapped to other ports below 512. Those between source ports 512 and 1023 will be mapped to ports below 1024. All other ports will be mapped to 1024 or above. As previously stated, iptables will always try to maintain the source ports used by the actual workstation making the connection. Note that this has nothing to do with destination ports, so if a client tries to make contact with an HTTP server outside the packet filter, it will not be mapped to the FTP control port.

TCPMSS target

The TCPMSS target (there is also the match) can be used to alter the MSS (Maximum Segment Size) value of TCP SYN packets that the packet filter sees.

The MSS value is used to control the maximum size of packets for specific connections. Under normal circumstances, this means the size of the MTU (Maximum Transmission Unit) value, minus 40 bytes. This is used to overcome some ISP's and servers that block ICMP fragmentation needed packets, which can result in really weird problems which can mainly be described such that everything works perfectly from our packet filter/router, but our local machines behind the packet filter cannot exchange large packets.

This could mean such things as mail servers being able to send small mails, but not large ones, web browsers that connect but then hang with no data received, and SSH connecting properly, but SCP hangs after the initial handshake. In other words, everything that uses any large packets will be unable to work.

The TCPMSS target is able to solve these problems, by changing the size of the packets going out through a connection. Please note that we only need to set the MSS on the SYN packet since the machines take care of the MSS after that. The target takes two arguments:

--set-mss
- Example: iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --set-mss 1460
- Explanation: The --set-mss argument explicitly sets a specific MSS value of all outgoing packets. In the example above, we set the MSS of all SYN packets going out over the eth0 interface to 1460 bytes — normal MTU for ethernet is 1500 bytes, minus 40 bytes is 1460 bytes. MSS only has to be set properly in the SYN packet, and then the peer machines take care of the MSS automatically.
--clamp-mss-to-pmtu
- Example: iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth0 -j TCPMSS --clamp-mss-to-pmtu
- Explanation: The --clamp-mss-to-pmtu automatically sets the MSS to the proper value, hence we do not need to explicitly set it. It is automatically set to PMTU (Path Maximum Transmission Unit) minus 40 bytes, which should be a reasonable value for most applications.

TOS target

The TOS (Type of Services) target is used to set the type of service field within the IP header. Note that this target is only valid within the mangle table.

The TOS field consists of 8 bits which are used to help in routing packets. This is one of the fields that can be used directly within iproute2 and its subsystem for routing policies. Worth noting, is that if we handle several separate packet filters and routers, this is the only way to propagate routing information within the actual packet between these routers and packet filters.

As previously noted, the MARK target (which sets a MARK associated with a specific packet) is only available within the kernel, and cannot be propagated with the packet. If we feel a need to propagate routing information for a specific packet or stream, we should therefore set the TOS field, which was developed for this.

There are currently a lot of routers on the Internet which do a pretty bad job at this, so as of now it may prove to be a bit useless to attempt TOS mangling before sending the packets on to the Internet. At best the routers will not pay any attention to the TOS field. At worst, they will look at the TOS field and do the wrong thing.

However, as stated above, the TOS field can most definitely be put to good use if we have a large WAN (Wide Area Network) or LAN (Local Area Network) with multiple routers. We then in fact have the possibility of giving packets different routes and preferences, based on their TOS value — even though this might be confined to our own network.

The TOS target is only capable of setting specific values, or named values on packets. These predefined TOS values can be found in the kernel include files, or more precisely, the ../linux/ip.h file.

The reasons are many, and we should actually never need to set any other values. However, there are ways around this limitation. To get around the limitation of only being able to set the named values on packets, we can use the FTOS feature/patch available at the Paksecured Linux Kernel patches site. However, we should be cautious with this patch i.e. we should not need to use any other than the default values, except in extreme cases.

The TOS target only takes one option as described below.

--set-tos
- Example: iptables -t mangle -A PREROUTING -p TCP --dport 22 -j TOS --set-tos 0x10
- Explanation: The --set-tos option tells the TOS mangler what TOS value to set onpackets that are matched. The option takes a numeric value, either in hex or in decimal value. As the TOS value consists of 8 bits, the value may be 0-255, or in hex 0x00-0xFF. Note that in the standard TOS target we are limited to using the named values available (which should be more or less standardized), as mentioned in the previous warning. These values are Minimize-Delay (decimal value 16, hex value 0x10), Maximize-Throughput (decimal value 8, hex value 0x08), Maximize-Reliability (decimal value 4, hex value 0x04), Minimize-Cost (decimal value 2, hex 0x02) or Normal-Service (decimal value 0, hex value 0x00). The default value on most packets is Normal-Service, or 0. Note that we can, of course, use the actual names instead of the actual hex values to set the TOS value — in fact, this is generally to be recommended, since the values associated with the names may be changed in future. For a complete listing of the descriptive values we can do an iptables -j TOS -h.

TTL target

The TTL (Time to Live) target is used to modify the time to live field in the IP header. The TTL target is only valid within the mangle table.

One useful application of this is to change all TTL values to the same value on all outgoing packets. One reason for doing this is if we have a bully ISP which does not allow us to have more than one machine connected to the same Internet connection, and who actively pursues this.

Setting all TTL values to the same value, will effectively make it a little bit harder for them to notice that we are doing this. We may then reset the TTL value for all outgoing packets to a standardized value, such as 64 as specified in the Linux kernel. It takes 3 options as of writing this, all of them described below:

--ttl-set
- Example: iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-set 64
- Explanation: The --ttl-set option tells the TTL target which TTL value to set on the packet in question. A good value would be around 64 somewhere. It is not too long, and it is not too short. Do not set this value too high, since it may affect our network and it is a bit immoral to set this value to high, since the packet may start bouncing back and forth between two badly configured routers, and the higher the TTL, the more bandwidth will be eaten unnecessarily in such a case. This target could be used to limit how far away our clients are. A good case of this could be DNS servers, where we do not want the clients to be too far away.
--ttl-dec
- Example: iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-dec 1
- Explanation: The --ttl-dec option tells the TTL target to decrement the time to live value by the amount specified after the --ttl-dec option. In other words, if the TTL for an incoming packet was 53 and we had set --ttl-dec 3, the packet would leave our machine with a TTL value of 49. The reason for this is that the networking code will automatically decrement the TTL value by 1, hence the packet will be decremented by 4 steps, from 53 to 49. This could for example be used when we want to limit how far away the people using our services are. For example, users should always use a close-by DNS, and hence we could match all packets leaving our DNS server and then decrease it by several steps. Of course, the --set-ttl may be a better idea for this usage.
--ttl-inc
- Example: iptables -t mangle -A PREROUTING -i eth0 -j TTL --ttl-inc 1
- Explanation: The --ttl-inc option tells the TTL target to increment the time to live value with the value specified to the --ttl-inc option. This means that we should raise the TTL value with the value specified in the --ttl-inc option, and if we specified --ttl-inc 4, a packet entering with a TTL of 53 would leave the machine with TTL 56. Note that the same thing goes here, as for the previous example of the --ttl-dec option, where the network code will automatically decrement the TTL value by 1, which it always does. This may be used to make our packet filter a bit more stealthy to trace-routes among other things. By setting the TTL one value higher for all incoming packets, we effectively make the packet filter hidden from trace-routes. Trace-routes are a loved and hated thing, since they provide excellent information on problems with connections and where it happens, but at the same time, it gives the hacker/cracker some good information about our upstreams if they have targeted us.

ULOG target

The ULOG target is used to provide userspace logging of matching packets.

If a packet is matched and the ULOG target is set, the packet information is multicasted together with the whole packet through a netlink socket. One or more userspace processes may then subscribe to various multicast groups and receive the packet i.e. this is a more complete and more sophisticated logging facility that is only used by iptables and netfilter so far, and it contains much better facilities for logging packets.

This target enables us to log information to MySQL databases, and other databases, making it much simpler to search for specific packets, and to group log entries. We can find the ULOGD userland applications at the ULOGD project page.

--ulog-nlgroup
- Example: iptables -A INPUT -p TCP --dport 22 -j ULOG --ulog-nlgroup 2
- Explanation: The --ulog-nlgroup option tells the ULOG target which netlink group to send the packet to. There are 32 netlink groups, which are simply specified as 1-32. If we would like to reach netlink group 5, we would simply write --ulog-nlgroup 5. The default netlink group used is 1.
--ulog-prefix
- Example: iptables -A INPUT -p TCP --dport 22 -j ULOG --ulog-prefix "SSH connection attempt: "
- Explanation: The --ulog-prefix option works just the same as the prefix value for the standard LOG target. This option prefixes all log entries with a user-specified log prefix. It can be 32 characters long, and is definitely most useful to distinguish different log-messages and where they came from.
--ulog-cprange
- Example: iptables -A INPUT -p TCP --dport 22 -j ULOG --ulog-cprange 100
- Explanation: The --ulog-cprange option tells the ULOG target how many bytes of the packet to send to the userspace daemon of ULOG. If we specify 100 as above, we would copy 100 bytes of the whole packet to userspace, which would include the whole header hopefully, plus some leading data within the actual packet. If we specify 0, the whole packet will be copied to userspace, regardless of the packets size. The default value is 0, so the whole packet will be copied to userspace.
--ulog-qthreshold
- Example: iptables -A INPUT -p TCP --dport 22 -j ULOG --ulog-qthreshold 10
- Explanation: The --ulog-qthreshold option tells the ULOG target how many packets to queue inside the kernel before actually sending the data to userspace. For example, if we set the threshold to 10 as above, the kernel would first accumulate 10 packets inside the kernel, and then transmit it outside to the userspace as one single netlink multi part message. The default value here is 1 because of backward compatibility, the userspace daemon did not know how to handle multi-part messages previously.

Network Address Translation

NAT (Network Address Translation) is one of the biggest attractions of Linux and netfilter to this day it seems. Instead of using fairly expensive third party solutions from Juniper/Cisco/etc., a lot of companies and individuals have chosen to go with netfilter instead.

One of the main reasons is that it is cheap, and secure (no blackbox because it is FLOSS (Free/Libre Open Source Software)). All it requires is a piece of hardware appropriate to the planned use case, a fairly new Linux kernel which we can download for free from the Internet, one or two NICs (Network Interface Cards) and cabling.

NAT Use Cases and Terms

Basically, NAT allows a machine or several machines to share the same IP address. For example, let us say we have a LAN consisting of 5-10 clients. We set their default gateways to point through the NAT server. Normally the packet would simply be forwarded by the gateway machine, but in the case of an NAT server it is a little bit different.

NAT servers translates the source and/or destination addresses of IP packets to different addresses. The NAT server receives the packet, rewrites the source and/or destination address and then recalculates the checksum of the packet.

One of the most common usages of NAT is the SNAT (Source Network Address Translation) function. Basically, this is used when we have only one public IP address but several machines (note that those may as well be virtual machines, or real physical ones or even a mixture of the both) within our LAN which we want to connect to the Internet, plus, we cannot afford or see any real benefit in having a public IP for each and every one of those machines within our LAN.

In that case, we use one of the private IP addresses for our LAN e.g. 192.168.1.0/24, and then turn on SNAT. The packet filter will then use SNAT and translate all 192.168.1.0/24 addresses into it is own public IP address i.e. rewrite the source IP address for each outgoing IP packet to for example 145.115.95.34. This way, there will be 5-10 clients or many many more using the same shared IP address.

There is also something called DNAT (Destination Network Address Translation), which can be extremely helpful when it comes to setting up servers etc.

First of all, we can help the greater good when it comes to saving IP space, second, we can get an more or less totally impenetrable packet filter in between LAN internal machines and any outside net e.g. the Internet, and/or simply share an IP for several machines that are separated into several physically different machines.

For example, we may run a small company server farm containing a httpd and ftpd on the same physical machine (e.g. by using OpenVZ) while there is a second physically separated machine containing a couple of different IM (Instant Messaging) services that the employees working from home or on the road can use to keep in touch with the employees that are on-site.

We may then run all of these services on the same IP address from the outside via DNAT. The above example is also based on separate port NAT'ing, or often called PNAT (Port Network Address Translation). We do not refer to this very often, since it is covered by the DNAT and SNAT functionality in netfilter anyway.

In Linux, there are actually two separate types of NAT that can be used

fast-NAT or
netfilter-NAT.

Fast-NAT is implemented inside the IP routing code of the Linux kernel, while netfilter-NAT is also implemented in the Linux kernel, but inside the netfilter code.

Fast-NAT is generally called by this name since it is much faster than the netfilter NAT code. It does not keep track of connections, and this is both its main pro and con.

Connection tracking takes a lot of processor power, and hence it is slower, which is one of the main reasons that fast-NAT is faster than netfilter-NAT. As we also said, the bad thing about fast-NAT does not track connections, which means it will not be able to do SNAT very well for whole networks, neither will it be able to NAT complex protocols such as FTP, IRC and other protocols that netfilter-NAT is able to handle very well. It is possible, but it will take much, much more work than would be expected from the netfilter implementation.

There is also a final word that is basically a synonym to SNAT, which is the masquerade word. In netfilter, masquerade is pretty much the same as SNAT with the exception that masquerading will automatically set the new source IP to the default IP address of the outgoing network interface.

Caveats using NAT

As we have already explained to some extent, there are quite a lot of minor caveats with using NAT. The main problem is that certain protocols and applications may not at all within some NAT setup. Hopefully, these applications are not too common and even if they happen to be present, it should always be possible to segregate them into some non-NAT environment.

The second and smaller problem is applications and protocols which will only work partially. These protocols are more common than the ones that will not work at all, which is quite unfortunate, but there is not very much we can do about it as it seems. If complex protocols continue to be built, this is a problem we will have to continue living with. Especially if the protocols are not standardized like for example Skype, ICQ etc.

The third, and largest problem is the fact that the user who sits behind a NAT server to get out on the internet will not be able to run his own server.

It could be done, of course, but it takes a lot more time and work to set this up. In companies, this is probably preferred over having tons of servers run by different employees that are reachable from the Internet, without any supervision.

However, when it comes to home users, this should be avoided to the very last. We should never as an Internet service provider NAT our customers from a private IP range to a public IP. It will cause us more trouble than it is worth having to deal with, and there will always be one or another client which will want this or that protocol to work flawlessly. When it does not, we will be called down upon.

As one last note on the caveats of NAT, it should be mentioned that NAT is actually just a hack more or less. NAT was a solution that was worked out while the IANA and other organisations noted that the Internet grew exponentially, and that the IP addresses would soon be in shortage.

NAT was and is a short term solution to the address shortage problem with IPv4 — the long term solution to the IPv4 address shortage is the IPv6 protocol, which also solves a ton of other problems.

IPv6 has 128 bits assigned to their addresses, while IPv4 only has 32 bits used for IP addresses. This is an incredible increase in address space.

Example NAT machine in theory

This is a small theoretical scenario where we want a NAT server between 2 different networks and an Internet connection.

What we want to do is to connect 2 networks to each other, and both networks should have access to each other and the Internet. We will discuss the hardware questions we should take into consideration, as well as other theory we should think about before actually starting to implement the NAT machine.

What is needed to build a NAT Machine

Before we discuss anything further, we should start by looking at what kind of hardware is needed to build a Linux machine doing NAT.

For most smaller networks, this should be no problem, but if we are starting to look at larger networks, it can actually become one. The biggest problem with NAT is that it eats resources quite fast.

For a small private network with possibly 1-10 users, a Pentium with 256MB of RAM (Random Access Memory) will do more than enough. However, if we are starting to get up around 100 or more users, we should start considering what kind of hardware we should look at.

Of course, it is also a good idea to consider bandwidth usage, and how many connections will be open at the same time. Generally, spare computers will do very well however, and this is one of the big pros of using a Linux based packet filter. We may use old hardware that we have left over, and hence the packet filter will be very cheap in comparison to other packet filters.

My opinion however is that I never go for the cheap when it is about core IT infrastructure components — we should opt for redundancy in the system in order to increase the overall availability of the system (i.e. basically increasing the MTTF (Mean Time To Failure) by decreasing the probability of an hardware caused system outage) which can be done by using a redundant power supply, hardware RAID (Redundancy Arrays of Independent Disks) and the like. In short, I would strongly recommend buying a decent server to do the package filtering and NAT with which, by using virtualization, we can also do other things like for example backup our workstations and the like.

We will also need to consider NICs (Network Interface Cards). How many separate networks will connect to our NAT/filter machine? Most of the time it is simply enough to connect one LAN to the Internet.

If we connect to the Internet via Ethernet, we should generally have 2 ethernet cards or we use one NIC and set up virtual interfaces like for example eth0:0, eth0:1 and so forth. It might also be a good idea to choose a 1000 Mbit/s network card of a relatively good brand (e.g. Qlogic) for scalability and reliance, but mostly any kinds of NIC will do as long as they have drivers in the Linux kernel.

A note on this matter: we should avoid using or getting NICs that do not have drivers in the Linux kernel. I have, on several occasions, found network cards/brands that have separately distributed drivers on discs that work dismally. They are generally not very well maintained, and if we get them to work on our kernel of choice to begin with, the chance that they will actually work on the next major Linux kernel upgrade is very small. This will most of the time mean that we may have to get a little bit more costly NIC, but in the end it is worth it.

Finally, one thing more to consider is how much RAM we put into the NAT/packet filter machine. It is a good idea to put in at least more than 512MB of memory if possible, even if it is possible run it on 256MB of RAM. NAT is not extremely huge on memory consumption, but it may be wise to add as much as possible just in case we will get more traffic than expected.

As we can see, there is quite a lot to think about when it comes to hardware. But, to be completely honest, in most cases we do not need to think about these points at all, unless we are building a NAT machine for a large network or company — in which case we pick a new and decent server anyway. Most home users need not think about this, but may more or less use whatever hardware they have at hand. There are no complete comparisons and tests on this topic, but we should fare rather well with just a little bit of common sense.

Placement of NAT Machines

This should look fairly simple, however, it may be harder than we originally thought in large networks.

In general, the NAT machine should be placed on the perimeter of the network, just like any packet filtering machine out there. This, most of the time, means that the NAT and packet filtering machines are the same machine, of course. Also worth a thought, if we have very large networks, it may be worth splitting the network into smaller networks using VLANs (Virtual Local Area Networks) and assign a NAT/filtering machine for each of these networks. Since NAT takes quite a lot of processing power, this will definitely help keep RTT (Round Trip Time) down.

In our example network as we described above i.e. two LANs and an Internet connection, we should look at how large the two networks are.

If we can consider them to be small (>= /25 or so) and depending on what requirements the clients have a couple of hundred clients should be no problem on a decent NAT machine.

Otherwise, we could have split up the load over several machines by setting public IP's on smaller NAT machines, each handling their own LAN segment and then let the traffic congregate over a specific routing only machine.

This of course takes into consideration that we must have enough public IP's for all of our NAT machines, and that they are routed through our dedicated routing machine.

How to place Proxies

Proxies are a general problem when it comes to NAT in most cases unfortunately, especially transparent proxies.

Normal proxies should not cause too much trouble, but creating a transparent proxy is a dog to get to work, especially on larger networks. The first problem is that proxies take quite a lot of processing power, just the same as NAT does. To put both of these on the same machine is not advisable if we are going to handle large network traffic.

The second problem is that if we SNAT as well as DNAT, the proxy will not be able to know what machines to contact i.e. which server is the client trying to contact? All that information is lost during the NAT translation since the packets cannot contain that information as well if they are NAT'ed.

Locally, this has been solved by adding the information in the internal data structures that are created for the packets, and hence proxies such as squid can get the information.

As we can see, the problem is that we do not have much of a choice if we are going to run a transparent proxy. There are, of course, possibilities, but they are not advisable really.

One possibility is to create a proxy outside the packet filter and create a routing entry that routes all web traffic through that machine, and then locally on the proxy machine NAT the packets to the proper ports for the proxy. This way, the information is preserved all the way to the proxy machine and is still available on it.
The second possibility is to simply create a proxy outside the packet filter, and then block all webtraffic except the traffic going to the proxy. This way, we will force all users to actually use the proxy. It is a crude way of doing it, but it will work.

The final Stage of our NAT Machine

As a final step, we should bring all of this information together, and see how we would solve the NAT machine issue.

The NAT/filtering machine has a public IP address, as well as the router and any other machines that may be available on the Internet. All of the machines inside the NAT'ed networks will be using private IP's, hence saving both a lot of cash as well as IPv4 address space.

Let us take a look at a picture of the networks and how it looks. We have decided to put the proxy server at the perimeter of our LAN, just outside the NAT/filtering machine.

However, the proxy machine is still within a DMZ (Demilitarized Zone) and thus protected. The DMZ containing the proxy and possibly other machines is connected to the Internet as well as both of our LANs trough the NAT/filter machine as can be seen below:

All the normal traffic from the NAT'ed networks will be sent through the DMZ directly to the router (note that this is a dedicated router i.e. not the instance used for NAT/filtering), which will send the traffic on out to the Internet. Except, webtraffic which is instead marked inside the netfilter part of the NAT machine, and then, based on the mark, routed to the proxy machine.

Let us take a look at what we are talking about. Say a HTTP packet is seen by the NAT machine. The mangle table can then be used to mark the packet with a netfilter mark (also known as nfmark).

Even later when we should route the packets to our router, we will be able to check for the nfmark within the routing tables, and based on this mark, we can choose to route the HTTP packets to the proxy machine. The proxy machine will then do its work.

SNAT

There is also a final word that is basically a synonym to SNAT, which is the Masquerade word. In netfilter, masquerade is pretty much the same as SNAT with the exception that masquerading will automatically set the new source IP to the default IP address of the outgoing network interface.
netstat-nat
#netmap_target

Masquerading

The MASQUERADE target is used basically the same as the SNAT target, but it does not require any —to-source option. The reason for this is that the MASQUERADE target was made to work with, for example, dial-up connections, or DHCP connections, which gets dynamic IP addresses when connecting to the network in question. This means that you should only use the MASQUERADE target with dynamically assigned IP connections, which we don't know the actual address of at all times. If you have a static IP connection, you should instead use the SNAT target.
The MASQUERADE target is used in exactly the same way as SNAT, but the MASQUERADE target takes a little bit more overhead to compute. The reason for this, is that each time that the MASQUERADE target gets hit by a packet, it automatically checks for the IP address to use, instead of doing as the SNAT target does - just using the single configured IP address. The MASQUERADE target makes it possible to work properly with Dynamic DHCP IP addresses that your ISP might provide for your PPP, PPPoE or SLIP connections to the Internet.
Note that the MASQUERADE target is only valid within the POSTROUTING chain in the nat table, just as the SNAT target.

DNAT

Logging

WRITEME

ULOG

Particular Use Cases

WRITEME

A default setup

Thus, a proper packet filter setup would be one with a default deny policy, that is:

For the best security, a packet filter should be applied before the internet-facing interface is brought up. If you have a dynamic IP and need to use it in your ruleset, consider loading a simple deny-all packet filter (remember to allow DHCP) before bringing up the interface, then switching to the real firewall after the you get an IP.
incoming connections are allowed only to local services by allowed machines.
outgoing connections are only allowed to services used by your system (DNS, web browsing, POP, email...).
the forward rule denies everything (unless you are protecting other systems, see below).
all other incoming or outgoing connections are denied.
set ttl to increase by 1 so the packet filter stays stealthy
set ttl the same on all outgoing packets to hide internal LAN structure
The PREROUTING chain is only traversed by the first packet in a stream, which means that all subsequent packets will go totally unchecked in this chain.

Recent Module

WRITEME

OpenVZ

WRITEME

MARK Target

WRITEME

Apply Rules

It is time now to come up with a practical solution i.e. a rule set which we are going to feed to netfilter in order to do all kinds of fancy things like for example SNAT (Source Network Address Translation) or traffic shaping using the TOS (Type of Services) match and target.

Last but not least, 9 out of 10 people use netfilter in order to do packet filtering i.e. protect their infrastructure from malicious activities or erroneous software.

I have a Bash script (packet_filter) which does not blindly apply a set of netfilter/iptables rules for SNAT, TOS, packet filtering etc., but rather, which does so in a flexible and dynamic manner by looking at the system it is run on.

The rationale is simple. I wanted the whole task of securing a network or single machine with netfilter/iptables to be flexible enough so I can use it for my notebook, my workstation, which is a LAN machine, and also for servers which are either LAN machines or gateway machines ... this script can do it all.

Also, packet_filter has been made with the notion of OpenVZ in mind i.e. it allows to secure VEs (Virtual Environments) from the HN (Hardware Node). packet_filter is run on the HN i.e. even in a virtualized environment, there is only one central security instance that concerns us with regards to packet filtering. This makes things easy to maintain and to adapt to certain needs that might arise.

However, even so the script can be used for OpenVZ environments, it also works perfectly fine for non-OpenVZ environments (i.e. the standard Debian box).

packet_filter makes use of generic.sh (another one of my shell scripts) for generic functions which are used in several of my shell scripts and not just packet_filter. We need to download and put generic.sh in place as well to make things works.

One can find all of my scripts here — packet_filter and generic.sh are needed for firewalling. How to set things up is detailed further down.

Debugging / Testing

WRITEME

Start Packet Filter at System Boot

Once we have a packet filter in place and come up with a set of rules that we can load into it in order to protect our infrastructure, we want to have those protective measures in place automatically every time a machine boots.

What we need to do is to make it so that loading the rules into the packet filter happens automatically every time the machine boots. It might get rebooted by a manually issued reboot on the CLI (Command Line Interface), one may schedule reboots via a cron job or maybe there just bad luck and a power outage happens in which case, at some point, the machine will boot up again too.

The point is, if we do not take care of the fact that by default, the packet filter allows traffic without any restrictions, the level of protection, otherwise possible with netfilter/iptables, equals zero.

/etc/init.d/script_name or /etc/network/interfaces

In the past the number one choice how to accomplish an automatic loading of our rules has been /etc/init.d/<script_name>. Because of its unnecessary complexity and security related issues, Debian switched to the now standard approach of using /etc/network/interfaces. The switch was made official with releasing Etch.

One major benefit of using /etc/network/interfaces is that we can avoid the following very easy and most effectively:

there is a time slot in which the interfaces are all up and functioning already but in which the packet filter is not fully functional already
an attacker could now use this time slot to do harm, even take over the machine

Because we need/want to avoid this achilles heel, what we are going to do is:

At first, we perform a lockdown of the machine in question — no single bit can enter or leave the machine.
Next, we bring up the interfaces. When they are all up and functional, the machine still remains in lockdown.
At the end of this whole sequence, we load our rule set into the packet filter (using our script packet_filter) by which the machine transitions from a lockdown state to a state where it is protected by the packet filter — our rule set is applied by the packet filter on any kind of traffic, forwarded, inbound and outbound traffic.

As can be seen, at no time will the machine be online without either being in lockdown-mode or protected by the packet filter i.e. netfilter/iptables that is.

Configuration

We already know about packet_filter and generic.sh. Now we need to put packet_filter and generic.sh at a place where it can be accessed and thus do its job of testing for certain parameters as well as system settings and then load the appropriate rules.

Actually, we want to choose the path so that this can be done automatically but also manually i.e. without the need for some shell alias or the need to append something to PATH. Long story cut short, we put the scripts into /usr/local/bin

sa@wks:/usr/local/bin$ type ll; ll
ll is aliased to `ls -lh'
total 40K
-rwxr-xr-x 1 sa sa 3.3K 2009-05-13 19:53 generic.sh
-rwxr-xr-x 1 sa sa  37K 2009-06-15 19:58 packet_filter
sa@wks:/usr/local/bin$

Preparing /etc/network/interfaces

Next we use the pre-up and post-up commands in /etc/network/interfaces in order to carry out our 3 step sequence from above i.e. lockdown the machine, bring its interfaces up and finally, load our rule set based on decisions made by packet_filter as it gathers information about the machine and its configuration and its intended purpose (notebook, LAN machine, gateway, etc.).

Because both, pre-up and post-up as well as Bash come with an environment variable PATH set to /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin, the location (/usr/local/bin) we choose works fine. Also, FHS (Filesystem Hierarchy Standard) tells us to put packet_filter and generic.sh into /usr/local/bin.

Once we are done editing /etc/network/interfaces, below is what it looks like for my workstation which we know is a LAN machine and gets its IP assigned via DHCP (Dynamic Host Configuration Protocol):

 1  sa@wks:~$ grep -A44 primary /etc/network/interfaces
 2  # The primary network interface
 3  #   post-up /usr/local/bin/packet_filter start needs to run two times
 4  #   so we can determine the gateway ip and hostname; the first run
 5  #   opens the firewall which allows the second run to determine the
 6  #   gateway hostname and ip and use it for further settings
 7
 8  auto eth0
 9  iface eth0 inet dhcp
10      pre-up   /usr/local/bin/packet_filter lockdown
11      post-up  /usr/local/bin/packet_filter start
12      post-up  /usr/local/bin/packet_filter start
13      pre-down /usr/local/bin/packet_filter save
14  sa@wks:~$

If we have several pre-up commands in one stanza, they are executed in order of appearance — the same is true for post-up, pre-down and post-down. Lines 8 and 9 were in place before already i.e. they are as set up during installing the system.

What we added are lines 10 to 13. As mentioned, those lines perform the sequence we lined out above i.e. like this we make sure there never is a time slot where our network/machine is unprotected.

Line 10 does the lockdown. Next the interfaces are brought up. Then line 11 kicks in, transitioning the firewall from total lockdown to protected state. The reason we issue the same line twice is simple, packet_filter contains a few commands that need access to the outside world e.g. when we determine the gateway IP and hostname.

What happens is simply that after line 11 the firewall already protects us but, in order to gather more information, we run it again i.e. after line 12, our network/machine is protected and packet_filter was able to gather information it could not find locally and thus had to search for it outside my workstation.

We are done! At this point our network respectively machine is protected by netfilter and it all happened totally automatic at system boot. Last but not least, when the machine is going for a reboot or is about to be shut down, we save the current rule set in line 13. That information might be important to have for investigating some bug odd behavior if it might occur.

It is important to note that whatever script is used for packet filtering (packet_filter in our case), it needs to support the commands/parameters given to it above in lines 10 to 13. packet_filter does so of course as it supports

sa@wks:~$ grep -A6 '# Usage' /usr/local/bin/packet_filter
# Usage:             packet_filter start
#                    packet_filter restart
#                    packet_filter status
#                    packet_filter stop
#                    packet_filter save
#                    packet_filter panic|lockdown
### END INIT INFO
sa@wks:~$

Therefore, one can manually issue packet_filter status to get a status report on the current situation, what rules are active, how much traffic each rule saw, etc. packet_filter save saves the current rule set to the filesystem. packet_filter panic or packet_filter lockdown are synonymous, both do exactly the same, which is locking down the system.

About the later, we shall not issue packet_filter panic if we are logged into a machine remotely via SSH (Secure Shell) or we will have successfully locked ourselves out!

If we want to test it however, we can do so and either use iptables-apply or have a cron job in place which runs packet_filter stop (disable the firewall i.e. let traffic pass without restriction) every 5 minutes or so. When we are done testing, we remove the cron job again.

Port Knocking

WRITEME

http://www.portknocking.org/view/implementations
what is spa http://en.wikipedia.org/wiki/Single_Packet_Authorization
spa vs. port knocking
- http://www.cipherdyne.org/fwknop/docs/SPA.html
- http://www.cipherdyne.org/fwknop/docs/faq.html#diffportknocking
- Single Packet Authorization offers many advantages over port knocking, including non-replayability of SPA packets, ability to use asymmetric ciphers (such as Elgamal), and SPA cannot be broken by simply spoofing packets to duplicate ports within the knock sequence on the server to break port knocking authentication.
- Authorization packets are either encrypted with the Rijndael block cipher or via GnuPG and associated asymmetric ciphers.

fwknop

how do I use GPG authentication for fwknopd in conjunction with monkeysphere i.e.
- in ~/.ssh/config, use the ssh-proxycommand for fwknop in order to wrap the fwknop/monkeysphere step into one simple ssh call i.e. not even a shell alias for the first step (opening the sshd port fwknop on the server)
- http://code.google.com/p/ssh-fwknop/

What types of services can be protected by fwknop? Technically, any service that can be filtered by a Netfilter policy is a candidate for protection by fwknop. Having said this however, fwknop is most commonly used to provided an additional layer of security for services that typically have long running sessions such as OpenSSH or OpenVPN.
Any service protected by fwknop is inaccessible (by using iptables or ipfw to intercept packets within the kernel) before authenticating; anyone scanning for the service will not be able to detect that it is even listening.
Multiple users are supported by the fwknop server, and each user can be assigned their own symmetric or asymmetric encryption key via the /etc/fwknop/access.conf file.
For iptables firewalls, ACCEPT rules added by fwknop are added and deleted (after a configurable timeout) from custom iptables chains so that fwknop does not interfere with any existing iptables policy. The iptables rule additions are managed with the IPTables::ChainMgr module originally developed for the psad project.
Port randomization is supported for the destination port of SPA packets as well as the port over which the follow-on connection is made via the iptables NAT capabilities.
Supports the execution of shell commands on behalf of valid SPA packets.
The fwknop server can be configured to place multiple restrictions on inbound SPA packets beyond those enforced by encryption keys and replay attack detection. Namely, packet age, source IP address, remote user, access to requested ports, filtering regular expressions against commands, and more.

Prerequisites

up and running SSH (Secure Shell) setup

Install and Configure

sa@wks:~$ ssh website

        / \\      _-'
      _/   \\-''- _ /
 __-' {            \\
     /              \\
     /       "o.  |o }
     |            \\ ;            YOU ARE BEING WATCHED!
                   ',
        \\_         __\\
          ''-_    \\.//
            / '-____'
           /
         _'
       _-'


This computer system is the private property of its owner, whether individual, corporate or government. It is
for authorized use only. Users (authorized or unauthorized) have no explicit or implicit expectation of
privacy.

Any or all uses of this system and all files on this system may be intercepted, monitored, recorded, copied,
audited, inspected, and disclosed to your employer, to authorized site, government, and law enforcement
personnel, as well as authorized officials of government agencies, both domestic and foreign.

By using this system, the user consents to such interception, monitoring, recording, copying, auditing,
inspection, and disclosure at the discretion of such personnel or officials.


        UNAUTHORIZED OR IMPROPER USE OF THIS SYSTEM MAY RESULT
        IN CIVIL AND CRIMINAL PENALTIES AND ADMINISTRATIVE OR
        DISCIPLINARY ACTION, AS APPROPRIATE !!


By continuing to use this system you indicate your awareness of and consent to these terms and conditions of
use. LOG OFF IMMEDIATELY if you do not agree to the conditions stated in this warning. However, if you are
authorized personal with no bad intentions please continue. Have a nice day! :-)

sa@wks-ve10:~$ su
Password:
wks-ve10:/home/sa# netstat -tulpen
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode       PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      0          18892356    1690/apache2
tcp        0      0 0.0.0.0:18689           0.0.0.0:*               LISTEN      0          18882867    299/sshd
tcp6       0      0 :::18689                :::*                    LISTEN      0          18882869    299/sshd
wks-ve10:/home/sa# exit
exit
sa@wks-ve10:~$ exit
logout
Connection to 10.0.3.4 closed.
sa@wks:~$ nmap -p- 10.0.3.4

Starting Nmap 4.68 ( http://nmap.org ) at 2009-06-08 11:09 CEST
Interesting ports on 10.0.3.4:
Not shown: 65533 closed ports
PORT      STATE SERVICE
80/tcp    open  http
18689/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 1.410 seconds
sa@wks:~$

within VEs, we have to use venet0:0 instead of eth0 when configuring fwknopd
listening port for fwknop server UDP 62201
fwknop can be run in debug mode with the —debug command line option. This will disable daemon mode execution, and print verbose information to the screen on STDERR as packets are received Also, after issuing the first command, port 22 should be open on the server. I would use nmap to scan the server for specifically port 22 to see if the port is open.

Files

Take look at /etc/fwknop/fwknop.conf. The config files are

/etc/fwknop/access.conf
/etc/fwknop/fwknop.conf
/etc/fwknop/pf.os

GPG (GNU Privacy Guard)

http://cipherdyne.org/fwknop/docs/gpghowto.html

Port Randomization

http://www.cipherdyne.org/blog/2008/06/single-packet-authorization-with-port-randomization.html
use either -a or -R switch in case we connect from a machine with non-routable (i.e. private) IP address

Testing

http://www.cipherdyne.org/fwknop/docs/test_suite.html

Miscellaneous

fwknop Daemons

knopmd, knoptm, knopwatchd we consider those helpers to fwknopd

fwknop and Tor

see man 8 fwknop_serv

Authentication

WRITEME

http://en.wikipedia.org/wiki/NuFW
- http://www.nufw.org/Debian-packages.html
- http://www.nufw.org/docs/howto20.html

DoS, DDoS

WRITEME

built something Python-based e.g. use fabric to manage netfilter/iptables and enable/disable rules based on system monitoring (the usual OS-level stuff e.g. HTTP GETs, bandwidth, diskspace, etc.) and higher-level metrics such as number/type DNS queries, number of credit card requests/declines, number of user authentications, etc.
http://en.wikipedia.org/wiki/Denial-of-service_attack

OSI Layer 4

TCP
- http://en.wikipedia.org/wiki/SYN_%28TCP%29#Connection_establishment

OSI Layer 7

HTTP
- GET, POST

Slowloris

Pro-active Approaches

WRITEME

fail2ban

psad

PSAD is a collection of four lightweight system daemons written in Perl and in C that is designed to work with Linux firewalling code in order to detect port scans and act appropriate i.e. change firewalling rules on the fly, thus adapting the system to the current security threat.

fwsnort

Miscellaneous

WRITEME

xtables addons

http://netfilter.org/documentation/HOWTO/netfilter-extensions-HOWTO-4.html
http://xtables-addons.sourceforge.net/
- http://dev.medozas.de/files/xtables/xtables-addons.8.html
- no recompiling of wither kernel or iptables necessary; see /usr/share/doc/xtables-addons-source/README.Debian

Application Layer

GUIs

http://en.wikipedia.org/wiki/Comparison_of_firewalls

fwbuilder

fwbuilder

Saving / Restoring Rulesets

Instead of including all of the iptables rules in the /etc/ini.d/<name_of_shell_script_containing_ruleset> script we can use the iptables-restore program to restore the rules saved using iptables-save.

In order to do this we need to setup our rules, save the ruleset under a static location (such as /etc/default/firewall http://iptables-tutorial.frozentux.net/iptables-tutorial.html#SPEEDCONSIDERATIONS