Tracing a packet through the linux networking subsystems.

In this post we trace the path of a packet from through the linux networking subsystem. We go from a breif overview from packet’s arrival at the network card to its final destination on a socket’s receive queue.

Networking drivers and their woes

We try to take as example a simple networking driver which seems to be widely used as an example the RTL8139 ethernet driver. Its a fairly complicated driver for a ethernet PCI card which I dont pretend to understand. Just to document some key parts. Most of the source code can be found in

<linux-src>/drivers/net/ethernet/realtek/8139too.c

When the driver module is brought up it registers it self with the PCI subsystem in the kernel using the standard call of pci_register_driver. Defining standard interface methods, as shown below.

static struct pci_driver rtl8139_pci_driver = {
	.name		= DRV_NAME,
	.id_table	= rtl8139_pci_tbl,
	.probe		= rtl8139_init_one,
	.remove		= rtl8139_remove_one,
#ifdef CONFIG_PM
	.suspend	= rtl8139_suspend,
	.resume		= rtl8139_resume,
#endif /* CONFIG_PM */
};

As the board initialized via the probe function the method rtl8139_init_one is called. This method will confirm that a board plugged in is belongs to the current vendor. The board gets initialized and the we map the memory regions on the PCI device to and ioaddr.

   	ioaddr = pci_iomap(pdev, bar, 0);

Using the bar register of the pci device. This is so that we can perform memory mapped io with the device. We also initialize the key kernel structure used to describe the network device a huge datastructure called struct net_device.

The netdev data structure also contains a dev->netdev_ops defining device operations.

static const struct net_device_ops rtl8139_netdev_ops = {
	.ndo_open		= rtl8139_open,
	.ndo_stop		= rtl8139_close,
	.ndo_get_stats64	= rtl8139_get_stats64,
	.ndo_change_mtu		= rtl8139_change_mtu,
	.ndo_validate_addr	= eth_validate_addr,
	.ndo_set_mac_address 	= rtl8139_set_mac_address,
	.ndo_start_xmit		= rtl8139_start_xmit,
	.ndo_set_rx_mode	= rtl8139_set_rx_mode,
	.ndo_do_ioctl		= netdev_ioctl,
	.ndo_tx_timeout		= rtl8139_tx_timeout,
#ifdef CONFIG_NET_POLL_CONTROLLER
	.ndo_poll_controller	= rtl8139_poll_controller,
#endif
	.ndo_set_features	= rtl8139_set_features,
};

In the the function rtl8139_open as a key step we define a method to respond to interrupts from the hardware done as follows

   	retval = request_irq(irq, rtl8139_interrupt, IRQF_SHARED, dev->name, dev);

Where the irq to use is obtained from the pci configuration of the device. We also allocate two buffers to which will be mapped to the transfer and receive buffers on the device

	struct rtl8139_private *tp = netdev_priv(dev);
        ....
         ....

	tp->tx_bufs = dma_alloc_coherent(&tp->pci_dev->dev, TX_BUF_TOT_LEN,
					   &tp->tx_bufs_dma, GFP_KERNEL);
	tp->rx_ring = dma_alloc_coherent(&tp->pci_dev->dev, RX_BUF_TOT_LEN,
					   &tp->rx_ring_dma, GFP_KERNEL);

Too delay with highly interrupting devices linux has started to move to newer api called napi which can dynamically switch a device from polling to interrupt mode based on certain policy considerations.

Finally we perform certain device specific initializations in rtl8139_hw_start Like enabling interrupts on the device, setting receive modes and other device specific miscellany.

Having thus set up the device we allow the linux to start using this device to send packets by calling the key method

static inline void netif_start_queue(struct net_device *dev)

netif_start_queue (dev);

There is also a watchdog timer which I am punting on for now.

Now on packet receipt the device is going to raise the interrupt calling our method.

static irqreturn_t rtl8139_interrupt (int irq, void *dev_instance)

If we can schedule the running of napi we do it as shown here

	if (status & RxAckBits){
		if (napi_schedule_prep(&tp->napi)) {
			RTL_W16_F (IntrMask, rtl8139_norx_intr_mask);
			__napi_schedule(&tp->napi);
		}
	}

The receipt of the packets being are processed thus by the napi which will call the specified poll routine.

static int rtl8139_poll(struct napi_struct *napi, int budget)

Passed in a fixed budget which decides to perform the receipt now or later.

The actual method doing the receipt is :

static int rtl8139_rx(struct net_device *dev, struct rtl8139_private *tp,
		      int budget)

If all is well we will allocate a an skb. The key kernel datastructure to hold packets received and being processed up the protocol stack. Thus we now copy the packet from the device receive buffer into an skb. Update some device statistics. Detect the link layer protocl used by the packet and finally call the key method netif_receive_skb with the copied packet.

   netif_receive_skb (skb);

As of reading this its unclear to me if the copy happens in the context of the actual interrupt or in the context of the soft IRQ generated by the napi subsystem.

Either way the netif_receive_skb will take place. The skb is not going to get queued int to a per cpu packet backlog queue. Called softnet_data. Using the function

static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
			      unsigned int *qtail)

....
	__skb_queue_tail(&sd->input_pkt_queue, skb);
.....
	return NET_RX_SUCCESS;
...

A packet arrives

After successfully queuing the packet onto the cpu backlog queue we are going to return NET_RX_SUCCESS to the driver. Now moving away from the driver side of packet receipt to the operating system side of processing the packet. I still need to look into how the process_backlog queue getting invoked.

   static int process_backlog(struct napi_struct *napi, int quota)

Anyway our process function gets called at which point we dequeue the from the per queue

   static int process_backlog(struct napi_struct *napi, int quota)
   .....
		while ((skb = __skb_dequeue(&sd->process_queue))) {
			local_irq_enable();
			__netif_receive_skb(skb);
			local_irq_disable();
			input_queue_head_incr(sd);
			if (++work >= quota) {
				local_irq_enable();
				return work;
			}
		}

 ......

Its now the job of __netif_receive_skb to take the job of processing the skb. After some munging around of the skb our __netif_receive_skb_core will get called which will call the function deliver_skb.

A key step here is to determine the packet type that we are dealing with here. Different protocols register their packet types allowing themselves to become identifiable. The key packet type interface is described as follows :

struct packet_type {
	__be16			type;	/* This is really htons(ether_type). */
	struct net_device	*dev;	/* NULL is wildcarded here	     */
	int			(*func) (struct sk_buff *,
					 struct net_device *,
					 struct packet_type *,
					 struct net_device *);
	bool			(*id_match)(struct packet_type *ptype,
					    struct sock *sk);
	void			*af_packet_priv;
	struct list_head	list;
};

We can see the packet type of ipv4 in net/ipv4/af_inet.c. As shown here

#define ETH_P_IP	0x0800		/* Internet Protocol packet	*/
....
static struct packet_type ip_packet_type __read_mostly = {
	.type = cpu_to_be16(ETH_P_IP),
	.func = ip_rcv,
};

Thus our deliver_skb function is going to match the type of the packet as ip and call ip_rcv.

int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)

ip_input.c

In the ip_rcv function we are going to parse out the ip header from skb. Determine the length of the packet Update some statistics. Finally ending with the mysterious Netfilter hook which is generally used customize action on packets if we so choose. As shown here

   int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
   ....
	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,
		       ip_rcv_finish);
   .....                       

The key function that is provided to the netfilter hook is the ip_rcv_finish function which is called if netfilter wants to continue the processing of the packet.

static int ip_rcv_finish(struct sk_buff *skb) {
....
}

A packet begins its ascent

The ip_rcv_finish may need to look into the packet , check if it needs to be routed to other machines. I am only going to look at the case that the packet is destined to the current machine.

The ip layer consults the routing table and a routing table cache to find out where the packet is meant to be delivered.

Finally if the packet is to be delivered to the local host it returns a struct dst_entry with its input method set to ip_local_deliver.

The ip_local_deliver gets called we encounter another netfilter hook NF_INET_LOCAL_IN which is called as follows.

int ip_local_deliver(struct sk_buff *skb)
{
....
  return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
         ip_local_deliver_finish);
}

Thus finding we can now add a netfilter hook just for packets meant for the local host. Assuming again that netfilter allows for further processing of the packet we are now ready to begin further processing of the packet.

Inside of ip_local_deliver_finish we are now ready to examine the ip protocol to which the packet ought to be delivered. There is some thing about raw delivery which needs to be looked at but currently skipped.

static int ip_local_deliver_finish(struct sk_buff *skb)
{
        ......
        int protocol = ip_hdr(skb)->protocol;
        ....
	ipprot = rcu_dereference(inet_protos[protocol]);
        ...
   	ret = ipprot->handler(skb);
        ......
}

Notice how we look up the protocol in the ip header and then use this protocol look up the inet_protos array for implementing protocol finally calling its handler. These protocl handlers are initialized inet subsystem initialization with a call to inet_inet.

static int __init inet_init(void)
{
        ....
       	if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
		pr_crit("%s: Cannot add ICMP protocol\n", __func__);
	if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
		pr_crit("%s: Cannot add UDP protocol\n", __func__);
	if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
		pr_crit("%s: Cannot add TCP protocol\n", __func__);
#ifdef CONFIG_IP_MULTICAST
	if (inet_add_protocol(&igmp_protocol, IPPROTO_IGMP) < 0)
		pr_crit("%s: Cannot add IGMP protocol\n", __func__);
#endif
        ....
}

There thus we see the initialization of the protocl array with some common ip protocols. The protocol themselves are describe as follows.

static const struct net_protocol tcp_protocol = {
	.early_demux	=	tcp_v4_early_demux,
	.handler	=	tcp_v4_rcv,
	.err_handler	=	tcp_v4_err,
	.no_policy	=	1,
	.netns_ok	=	1,
	.icmp_strict_tag_validation = 1,
};

static const struct net_protocol udp_protocol = {
	.early_demux =	udp_v4_early_demux,
	.handler =	udp_rcv,
	.err_handler =	udp_err,
	.no_policy =	1,
	.netns_ok =	1,
};

static const struct net_protocol icmp_protocol = {
	.handler =	icmp_rcv,
	.err_handler =	icmp_err,
	.no_policy =	1,
	.netns_ok =	1,
};

We see each protocol defining its corresponding handlers.We however are only going to look at the udp handler. To keep it relatively simple.

Home Sweet Socket

UDP much like tcp contains a hash table of sockets that are currently in listening for packets.

/**
 *	struct udp_table - UDP table
 *
 *	@hash:	hash table, sockets are hashed on (local port)
 *	@hash2:	hash table, sockets are hashed on (local port, local address)
 *	@mask:	number of slots in hash tables, minus 1
 *	@log:	log2(number of slots in hash table)
 */
struct udp_table {
	struct udp_hslot	*hash;
	struct udp_hslot	*hash2;
	unsigned int		mask;
	unsigned int		log;
};
extern struct udp_table udp_table;

The details of this are left to the reader to delve into. Assuming that the packet was a udp packet its protocl must have been initialized to

#define IPPROTO_UDP		IPPROTO_UDP
  IPPROTO_IDP = 22,		/* XNS IDP protocol			*/

Now we begin to look up the socket and do simple checksum. The look up takes into account the source and destination and the source port and destination ports. As we see from the the arguments to the look method.

int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
		   int proto)
{
        ....
	sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, udptable);
        ....
}

struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
		__be16 sport, __be32 daddr, __be16 dport,
		int dif, struct udp_table *udptable)
                

If there is a socket listening we ought to find it. Finally calling the udp_queue_rcv_skb with the found socket and the skb packet.

   	ret = udp_queue_rcv_skb(sk, skb);

Which is going to finally translate into a call to sock_queue_rcv_skb. In case we are using some sort of socket filtering which I believe is somewhat similar to the Berkeley packet filter we pass the socket and and the skb to that socket filter. The underlying method for this is the sk_filter.

   int sk_filter(struct sock *sk, struct sk_buff *skb)
   ...
   int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
   {
   ...
   	err = sk_filter(sk, skb);
   ...
   }

We call the skb_set_owner_r to set the skb to have the found socket as its owner. And are now ready to queue this skb into the sockets receive queue.

int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
        ....
        struct sk_buff_head *list = &sk->sk_receive_queue;
        ....
       	__skb_queue_tail(list, skb);
        ....
}

Thus having reached the underlying socket.

Oh Packet, I waited for you so long.

When the inet subsystem gets initialized apart from initializing all sorts of caches and adding to the inet_protos array various ip protocols. We also initialize the socks subsystem with call to sock_register which ads a protocol handlers for various sockets.

   	(void)sock_register(&inet_family_ops);


static const struct net_proto_family inet_family_ops = {
	.family = PF_INET,
	.create = inet_create,
	.owner	= THIS_MODULE,
};

We might recognize the PF_INET as the protocol family that is used during the socket creation step. If we remember our socket programming one of the first steps in the creation of the socket is the socket system call which can be seen in socket.c. Which will thread down the a call to __sock_create. with all the usual.

SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
int __sock_create(struct net *net, int family, int type, int protocol,
			 struct socket **res, int kern)
{
...
	sock = sock_alloc();
        ....
       	pf = rcu_dereference(net_families[family]);

        ...
       	err = pf->create(net, sock, protocol, kern);
}

The protocol family that is passed is an integer referencing the net_families array.This is the very protocol family which we had created and initialized at the inet_init. Thus our pf create method is going to result in calling inet_create and the socket system call.

in af_inet.c we actually see the definition of various protocols of the ip family.

static struct inet_protosw inetsw_array[] =
{
	{
		.type =       SOCK_STREAM,
		.protocol =   IPPROTO_TCP,
		.prot =       &tcp_prot,
		.ops =        &inet_stream_ops,
		.flags =      INET_PROTOSW_PERMANENT |
			      INET_PROTOSW_ICSK,
	},
	{
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_UDP,
		.prot =       &udp_prot,
		.ops =        &inet_dgram_ops,
		.flags =      INET_PROTOSW_PERMANENT,
       },
       {
		.type =       SOCK_DGRAM,
		.protocol =   IPPROTO_ICMP,
		.prot =       &ping_prot,
		.ops =        &inet_dgram_ops,
		.flags =      INET_PROTOSW_REUSE,
       },
       {
	       .type =       SOCK_RAW,
	       .protocol =   IPPROTO_IP,	/* wild card */
	       .prot =       &raw_prot,
	       .ops =        &inet_sockraw_ops,
	       .flags =      INET_PROTOSW_REUSE,
       }
};

Each protocol defining its operations to common socket operations. Consider for example SOCK_DGRAM which is the the uses the udp_protocol. As we traverse through the inet_create we find that the struct socket which gets created also gets gets assigned a struct sock. If we remember the struct sock was the structure on to whose sk_receive_queue the final packet ended up. Here we are creating the empty queue on to which our sent and received packets will get placed. Still need to look at why the struct socket as used as a encapsulation layer over struct sock. Aneeways, moving on to our next method i.e bind() if we remember will the bind system call is defined in the kernel as follows

SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)

// Example usage

   
   serv_addr.sin_family = AF_INET;
   serv_addr.sin_addr.s_addr = INADDR_ANY;
   serv_addr.sin_port = htons(portno);
   
   /* Now bind the host address using bind() call.*/
   if (bind(sockfd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0)

Ok since the fd passed to the bind is a regular file descriptor The first thing we must do is convert this regular file descriptor to a struct socket. To do this we look at the current processes list files just as we would for a regular file. This file descriptor entry is had actually gotten added when we created the socket using socket fs. The struct socket for this file descriptor is tucked away in the file’s private data. As seen here :

struct socket *sock_from_file(struct file *file, int *err)
{
	if (file->f_op == &socket_file_ops)
		return file->private_data;	/* set in sock_map_fd */

	*err = -ENOTSOCK;
	return NULL;
}

We can see the cast to struct socket*. Finally having found the socket we are referring to we end up calling the bind of the underlying socket. as shown here.

SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
{
...

				err = sock->ops->bind(sock,
						      (struct sockaddr *)
						      &address, addrlen);
...
}

Ah but then one might ask what does the ops of our AF_INET socket which we created with SOCK_DGRAM point to ? I am just going to guess the ops is inet_dgram_ops. Thus perhaps it will be helpful to look at the inet_dgram_ops.

const struct proto_ops inet_dgram_ops = {
	.family		   = PF_INET,
	.owner		   = THIS_MODULE,
	.release	   = inet_release,
	.bind		   = inet_bind,
	.connect	   = inet_dgram_connect,
	.socketpair	   = sock_no_socketpair,
	.accept		   = sock_no_accept,
	.getname	   = inet_getname,
	.poll		   = udp_poll,
	.ioctl		   = inet_ioctl,
	.listen		   = sock_no_listen,
	.shutdown	   = inet_shutdown,
	.setsockopt	   = sock_common_setsockopt,
	.getsockopt	   = sock_common_getsockopt,
	.sendmsg	   = inet_sendmsg,
	.recvmsg	   = inet_recvmsg,
	.mmap		   = sock_no_mmap,
	.sendpage	   = inet_sendpage,
#ifdef CONFIG_COMPAT
	.compat_setsockopt = compat_sock_common_setsockopt,
	.compat_getsockopt = compat_sock_common_getsockopt,
	.compat_ioctl	   = inet_compat_ioctl,
#endif
};
EXPORT_SYMBOL(inet_dgram_ops);

We see a mapping for the bind method to the generic inet_bind. Which is used both by TCP and UDP. Inside inet_bind we see we get the underlying struct sock and now using the struct sockaddr *uaddr. can set it up with relevant information which will be used later.

int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
{
 .....
      
      	struct inet_sock *inet = inet_sk(sk);
 .....
  inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
  ...
  inet->inet_sport = htons(inet->inet_num);
}

Are you listening to the words coming out of my mouth.

While it seems that the fd can be used by any sort of file descriptor reader. I dont know much about it. Instead I shall look into the recvfrom method. An example usage of which could be

// Example Usage
char buffer[549];
struct sockaddr_storage src_addr;
socklen_t src_addr_len=sizeof(src_addr);
ssize_t count=recvfrom(fd,buffer,sizeof(buffer),0,(struct sockaddr*)&src_addr,&src_addr_len);
if (count==-1) {
    die("%s",strerror(errno));
} else if (count==sizeof(buffer)) {
    warn("datagram too large for buffer: truncated");
} else {
    handle_datagram(buffer,count);
}

// System call in the kernel
SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
		unsigned int, flags, struct sockaddr __user *, addr,
		int __user *, addr_len)
{
....
       	err = sock_recvmsg(sock, &msg, size, flags);
....
}

// Calling underlying socket recvmsg
static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
				       struct msghdr *msg, size_t size, int flags)
{
	struct sock_iocb *si = kiocb_to_siocb(iocb);

	si->sock = sock;
	si->scm = NULL;
	si->msg = msg;
	si->size = size;
	si->flags = flags;

	return sock->ops->recvmsg(iocb, sock, msg, size, flags);
}

Now switching over to the recvmsg. We see that we can receive messages by calling inet_recvmsg.

int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
		 size_t size, int flags)
{
....
       	err = sk->sk_prot->recvmsg(iocb, sk, msg, size, flags & MSG_DONTWAIT,
			   flags & ~MSG_DONTWAIT, &addr_len);
....
}

Which in the end just calls the udp protocols recvmsg . We can see it define all sorts of method that are common to protocols running over IP.

// udp.c
struct proto udp_prot = {
	.name		   = "UDP",
	.owner		   = THIS_MODULE,
	.close		   = udp_lib_close,
	.connect	   = ip4_datagram_connect,
	.disconnect	   = udp_disconnect,
	.ioctl		   = udp_ioctl,
	.destroy	   = udp_destroy_sock,
	.setsockopt	   = udp_setsockopt,
	.getsockopt	   = udp_getsockopt,
	.sendmsg	   = udp_sendmsg,
	.recvmsg	   = udp_recvmsg,
	.sendpage	   = udp_sendpage,
	.backlog_rcv	   = __udp_queue_rcv_skb,
	.release_cb	   = ip4_datagram_release_cb,
	.hash		   = udp_lib_hash,
	.unhash		   = udp_lib_unhash,
	.rehash		   = udp_v4_rehash,
	.get_port	   = udp_v4_get_port,
	.memory_allocated  = &udp_memory_allocated,
	.sysctl_mem	   = sysctl_udp_mem,
	.sysctl_wmem	   = &sysctl_udp_wmem_min,
	.sysctl_rmem	   = &sysctl_udp_rmem_min,
	.obj_size	   = sizeof(struct udp_sock),
	.slab_flags	   = SLAB_DESTROY_BY_RCU,
	.h.udp_table	   = &udp_table,
#ifdef CONFIG_COMPAT
	.compat_setsockopt = compat_udp_setsockopt,
	.compat_getsockopt = compat_udp_getsockopt,
#endif
	.clear_sk	   = sk_prot_clear_portaddr_nulls,
};
EXPORT_SYMBOL(udp_prot);

Thus the finally the call will find us trickling down to the udp_recvmsg.

int udp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		size_t len, int noblock, int flags, int *addr_len)
{
    .....
    	skb = __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
				  &peeked, &off, &err);
    .....
}

We we call here which blocks depending on the options and waits for the the skb in anycase. Of course the receive method for datagram is endlessly flexible in ways which we are currently not interested in. But for now we see a a loop which will wait for and assemble packets to be ready to be served to the user, working with various timeout issues as necessary.

struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
				    int *peeked, int *off, int *err)
{
...
	do {
        ...
        ..
        } while (!wait_for_more_packets(sk, err, &timeo, last));
....
}

Where wait_for_more_packets will optionally creates a wait queue on which it can wait until a packet arrives

static int wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
				 const struct sk_buff *skb)
{

	DEFINE_WAIT_FUNC(wait, receiver_wake_function);
       	prepare_to_wait_exclusive(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
        ...
        ..
       	*timeo_p = schedule_timeout(*timeo_p);
}

Enqueue the task to the sk->sk_wq (I think). periodically waking itself up and checking the queue (I think).

On waking up we walk through the sockets sk_receive_queue piking up first skb and return it

struct sk_buff *__skb_recv_datagram(struct sock *sk, unsigned int flags,
				    int *peeked, int *off, int *err)
{
        struct sk_buff_head *queue = &sk->sk_receive_queue;
         ....
 	skb_queue_walk(queue, skb) {
        {
         ....
         __skb_unlink(skb, queue);
         ....
         return skb;
        }
}

Now that we have gotten the skb from the network we need to copy it into the msg for the user to consume. This happens in th skb_copy_and_csum_datagram_msg passed in the message header. If the message is too big it may need to be chunked struct iov_iter *to an iterator of the msg.

int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
				   int hlen, struct msghdr *msg)
{
...
	if (iov_iter_count(&msg->msg_iter) < chunk) {
		if (__skb_checksum_complete(skb))
			goto csum_error;
		if (skb_copy_datagram_msg(skb, hlen, msg, chunk))
			goto fault;
	}
        ....
}

Where skb_copy_datagram_msg copies as much of the data as is required by the underlying application. Using the appropriate __copy_to_user method

__copy_to_user(v.iov_base, (from += v.iov_len) - v.iov_len,
			       v.iov_len),

Thus finally handing the data to user space.

Summary

The primary reference text which contains a lot of the gory details is the extremely detailed Linux Networking Internals which reads kind of like a bible. And for reference to kernel details there is the equally detailed Understanding Linux Kernel. And then there is always the good google search which invariable lands on to a lwn article. As always any comments or suggestions for improvements are always welcome. Nice tutorial on pci cards on tldp.For a more sane introduction to listening to UDP datagrams see UDP server. Lot of the material was discussed in a linux kernel class I am taking at UCSC whose reference site probably contains more accurate information. Clearly the most amazing thing about this is that all this logic really does get executed many times all over the internet and on the localhost for every packet. I guess I am writing this more as a brain dump of reading through the source in the 3.19-rc7 kernel networking subsystem,thus it may be highly unreliable and inaccurate. Users beware! With that fair warning lets try to begin.


Discussion, links, and tweets

I'm a developer interested in most things programming and mathematics related and the full spectrum in between.

comments powered by Disqus