Tracing a packet through the linux networking subsystems.
16 Mar 2015In this post we trace the path of a packet from through the linux networking subsystem. We go from a breif overview from packet’s arrival at the network card to its final destination on a socket’s receive queue.
Networking drivers and their woes
We try to take as example a simple networking driver which seems to be widely used as an example the RTL8139 ethernet driver. Its a fairly complicated driver for a ethernet PCI card which I dont pretend to understand. Just to document some key parts. Most of the source code can be found in
When the driver module is brought up it registers it self with the PCI
subsystem in the kernel using the standard call of
pci_register_driver
. Defining standard interface methods, as shown
below.
As the board initialized via the probe function the method rtl8139_init_one is called. This method will confirm that a board plugged in is belongs to the current vendor. The board gets initialized and the we map the memory regions on the PCI device to and ioaddr.
Using the bar register of the pci device. This is so that we can perform memory mapped io with the device. We also initialize the key kernel structure used to describe the network device a huge datastructure called struct net_device.
The netdev data structure also contains a dev->netdev_ops defining device operations.
In the the function rtl8139_open
as a key step we define a method to
respond to interrupts from the hardware done as follows
Where the irq to use is obtained from the pci configuration of the device. We also allocate two buffers to which will be mapped to the transfer and receive buffers on the device
Too delay with highly interrupting devices linux has started to move to newer api called napi which can dynamically switch a device from polling to interrupt mode based on certain policy considerations.
Finally we perform certain device specific initializations in
rtl8139_hw_start
Like enabling interrupts on the device, setting
receive modes and other device specific miscellany.
Having thus set up the device we allow the linux to start using this device to send packets by calling the key method
There is also a watchdog timer which I am punting on for now.
Now on packet receipt the device is going to raise the interrupt calling our method.
If we can schedule the running of napi we do it as shown here
The receipt of the packets being are processed thus by the napi which will call the specified poll routine.
Passed in a fixed budget which decides to perform the receipt now or later.
The actual method doing the receipt is :
If all is well we will allocate a an skb. The key kernel datastructure
to hold packets received and being processed up the protocol
stack. Thus we now copy the packet from the device receive buffer into
an skb. Update some device statistics. Detect the link layer protocl
used by the packet and finally call the key method netif_receive_skb
with the copied packet.
As of reading this its unclear to me if the copy happens in the context of the actual interrupt or in the context of the soft IRQ generated by the napi subsystem.
Either way the netif_receive_skb
will take place. The skb is not going
to get queued int to a per cpu packet backlog queue. Called
softnet_data
. Using the function
A packet arrives
After successfully queuing the packet onto the cpu backlog queue we
are going to return NET_RX_SUCCESS
to the driver. Now moving away from
the driver side of packet receipt to the operating system side of
processing the packet. I still need to look into how the
process_backlog
queue getting invoked.
Anyway our process function gets called at which point we dequeue the from the per queue
Its now the job of __netif_receive_skb
to take the job of processing
the skb. After some munging around of the skb our
__netif_receive_skb_core
will get called which will call the function
deliver_skb
.
A key step here is to determine the packet type that we are dealing with here. Different protocols register their packet types allowing themselves to become identifiable. The key packet type interface is described as follows :
We can see the packet type of ipv4 in net/ipv4/af_inet.c. As shown here
Thus our deliver_skb
function is going to match the type of the packet
as ip and call ip_rcv
.
In the ip_rcv
function we are going to parse out the ip header from
skb. Determine the length of the packet Update some
statistics. Finally ending with the mysterious Netfilter hook which is
generally used customize action on packets if we so choose. As shown here
The key function that is provided to the netfilter hook is the
ip_rcv
_finish function which is called if netfilter wants to continue
the processing of the packet.
A packet begins its ascent
The ip_rcv
_finish may need to look into the packet , check if it needs
to be routed to other machines. I am only going to look at the case
that the packet is destined to the current machine.
The ip layer consults the routing table and a routing table cache to find out where the packet is meant to be delivered.
Finally if the packet is to be delivered to the local host it returns a
struct dst_entry
with its input method set to ip_local_deliver
.
The ip_local_deliver
gets called we encounter another netfilter hook
NF_INET_LOCAL_IN
which is called as follows.
Thus finding we can now add a netfilter hook just for packets meant for the local host. Assuming again that netfilter allows for further processing of the packet we are now ready to begin further processing of the packet.
Inside of ip_local_deliver_finish
we are now ready to examine the ip
protocol to which the packet ought to be delivered. There is some
thing about raw delivery which needs to be looked at but currently
skipped.
Notice how we look up the protocol in the ip header and then use this protocol look up the inet_protos array for implementing protocol finally calling its handler. These protocl handlers are initialized inet subsystem initialization with a call to inet_inet.
There thus we see the initialization of the protocl array with some common ip protocols. The protocol themselves are describe as follows.
We see each protocol defining its corresponding handlers.We however are only going to look at the udp handler. To keep it relatively simple.
Home Sweet Socket
UDP much like tcp contains a hash table of sockets that are currently in listening for packets.
The details of this are left to the reader to delve into. Assuming that the packet was a udp packet its protocl must have been initialized to
Now we begin to look up the socket and do simple checksum. The look up takes into account the source and destination and the source port and destination ports. As we see from the the arguments to the look method.
If there is a socket listening we ought to find it. Finally calling
the udp_queue_rcv_skb
with the found socket and the skb packet.
Which is going to finally translate into a call to
sock_queue_rcv_skb
. In case we are using some sort of socket filtering
which I believe is somewhat similar to the Berkeley packet filter we
pass the socket and and the skb to that socket filter. The underlying
method for this is the sk_filter
.
We call the skb_set_owner_r
to set the skb to have the found socket as
its owner. And are now ready to queue this skb into the sockets
receive queue.
Thus having reached the underlying socket.
Oh Packet, I waited for you so long.
When the inet subsystem gets initialized apart from initializing all
sorts of caches and adding to the inet_protos array various ip
protocols. We also initialize the socks subsystem with call to
sock_register
which ads a protocol handlers for various sockets.
We might recognize the PF_INET
as the protocol family that is used
during the socket creation step. If we remember our socket programming
one of the first steps in the creation of the socket is the socket
system call which can be seen in socket.c
. Which will thread down the
a call to __sock_create
. with all the usual.
The protocol family that is passed is an integer referencing the net_families array.This is the very protocol family which we had created and initialized at the inet_init. Thus our pf create method is going to result in calling inet_create and the socket system call.
in af_inet.c
we actually see the definition of various protocols of the ip family.
Each protocol defining its operations to common socket
operations. Consider for example SOCK_DGRAM
which is the the uses
the udp_protocol. As we traverse through the inet_create we find that
the struct socket
which gets created also gets gets assigned a
struct sock
. If we remember the struct sock
was the structure on
to whose sk_receive_queue
the final packet ended up. Here we are
creating the empty queue on to which our sent and received packets
will get placed. Still need to look at why the struct socket
as used
as a encapsulation layer over struct sock
. Aneeways, moving on to
our next method i.e bind()
if we remember will the bind system call
is defined in the kernel as follows
Ok since the fd
passed to the bind is a regular file descriptor The
first thing we must do is convert this regular file descriptor to a
struct socket
. To do this we look at the current processes list
files just as we would for a regular file. This file descriptor entry
is had actually gotten added when we created the socket using socket
fs. The struct socket for this file descriptor is tucked away in the
file’s private data. As seen here :
We can see the cast to struct socket*
. Finally having found the
socket we are referring to we end up calling the bind
of the
underlying socket. as shown here.
Ah but then one might ask what does the ops of our AF_INET socket
which we created with SOCK_DGRAM
point to ? I am just going to guess
the ops is inet_dgram_ops
. Thus perhaps it will be helpful to look
at the inet_dgram_ops
.
We see a mapping for the bind method to the generic inet_bind
. Which
is used both by TCP and UDP. Inside inet_bind
we see we get the
underlying struct sock
and now using the struct sockaddr
*uaddr
. can set it up with relevant information which will be used
later.
Are you listening to the words coming out of my mouth.
While it seems that the fd
can be used by any sort of file
descriptor reader. I dont know much about it. Instead I shall look
into the recvfrom
method. An example usage of which could be
Now switching over to the recvmsg. We see that we can receive messages
by calling inet_recvmsg
.
Which in the end just calls the udp protocols recvmsg
. We can see
it define all sorts of method that are common to protocols running
over IP.
Thus the finally the call will find us trickling down to the
udp_recvmsg
.
We we call here which blocks depending on the options and waits for
the the skb
in anycase. Of course the receive method for datagram is
endlessly flexible in ways which we are currently not interested
in. But for now we see a a loop which will wait for and assemble
packets to be ready to be served to the user, working with various
timeout issues as necessary.
Where wait_for_more_packets
will optionally creates a wait queue on
which it can wait until a packet arrives
Enqueue the task to the sk->sk_wq
(I think). periodically waking
itself up and checking the queue (I think).
On waking up we walk through the sockets sk_receive_queue
piking up
first skb and return it
Now that we have gotten the skb
from the network we need to copy it
into the msg for the user to consume. This happens in th
skb_copy_and_csum_datagram_msg
passed in the message header. If the
message is too big it may need to be chunked struct iov_iter *to
an
iterator of the msg
.
Where skb_copy_datagram_msg
copies as much of the data as is
required by the underlying application. Using the appropriate
__copy_to_user
method
Thus finally handing the data to user space.
Summary
The primary reference text which contains a lot of the gory details is the extremely detailed Linux Networking Internals which reads kind of like a bible. And for reference to kernel details there is the equally detailed Understanding Linux Kernel. And then there is always the good google search which invariable lands on to a lwn article. As always any comments or suggestions for improvements are always welcome. Nice tutorial on pci cards on tldp.For a more sane introduction to listening to UDP datagrams see UDP server. Lot of the material was discussed in a linux kernel class I am taking at UCSC whose reference site probably contains more accurate information. Clearly the most amazing thing about this is that all this logic really does get executed many times all over the internet and on the localhost for every packet. I guess I am writing this more as a brain dump of reading through the source in the 3.19-rc7 kernel networking subsystem,thus it may be highly unreliable and inaccurate. Users beware! With that fair warning lets try to begin.