Linux TCP/IP 协议栈学习(4)—— Linux Socket (Part II)

Packet, Raw, Netlink, and Routing Sockets :
 
Netlink, routing, packet, and raw are all types of specialized sockets.
Netlink provides a socket-based interface for communication of messages and settings between the user and the internal protocols
 
Rtnetlink is for application-level management of the neighbor tables and IP routing tables
 
Packet sockets are accessed by the application when it sets AF_PACKET in the family field of the socket call. 

ps = socket (PF_PACKET , int type , int protocol ); 

Type is set to either SOCK_RAW or SOCK_DGRAM. Protocol has the number of the protocol
and is the same as the IP header protocol number or one of the valid protocol numbers.
 
 
Raw sockets allow user-level application code to receive and transmit network layer packets by intercepting them before they pass through the transport layer.
 
rs = socket ( PF_INET , SOCK_RAW , int protocol ); 
 
Protocol is set to the protocol number that the application wants to transmit or receive.A common example of the use of raw sockets is the ping command When the ping application code opens the socket, it sets the protocol field in the socket call to 

IPPROTO_ICMP. Ping and other application programs for route and network maintenance make
use of a Linux utility library call to convert a protocol name into a protocol number, getprotent(3)

 
 
Netlink sockets are accessed by calling socket with family set to AF_NETLINK.
 
ns = socket (AF_NETLINK , int type , int netlink_family );
 
The type parameter can be set to either SOCK_DGRAM or SOCK_STREAM, but it doesn’t
really matter because the protocol accessed by is determined by netlink_family, and this parameter is set to one of the values in Table 5.6. The send and recv socket calls are generally used with netlink
Linux TCP/IP 协议栈学习(4)—— Linux Socket (Part II)
 
 
Implementation of the Socket API System Calls :
 
there are several steps involved with directing each application layer socket call to the specific protocol that must respond to the request.
 
也就是说,socket 可以很多协议族关联,当选定一个协议的时候,需要与相应的函数关联,需要有几个步骤。
 
First, any address referenced in the call’s arguments must be mapped from user space to kernel space. 
Next, the functions themselves must be translated from generic socket layer functions to the specific functions for the protocol family. 
Finally, the functions must be translated from the protocol family generic functions to the specific functions for the member protocol in the family.
 
Once we have a pointer to the socket structure, we retrieve the function specific to the address family and protocol type through the open socket. To do this, we call the protocol- specific function through a pointer in the structure pointed to by the ops field of the socket structure
 
 
asmlinkage long sys_socketcall(int call , unsigned long __user *args);
 
The first thing it does is map each address from user space to kernel space. It does this by calling copy_from_user. Next, sys_socketcall invokes the system call function that corresponds to auserlevel socket call. For example, when the user calls bind, sys_socket call maps the userlevel bind to the kernel function, sys_bind, and listen is mapped to sys_listen.
 
Sys_sendmsg and sys_recvmsg have a bit more work to do than the other socket functions. They must verify that the iovec buffer array contains valid addresses first. Each address is mapped from kernel to user space later when the data is actually transferred but the addresses are validated now. After completing the validation of the iovec structure, sock_sendmsg and sock_recvmsg functions are called, respectively. 
Sys_accept is a bit more complicated because it has to establish a new socket for the new incoming connection. The first thing it does is call sock_alloc to allocate a new socket. Next, it has to get a name for the socket by calling the function pointed to by the getname field in the ops field in the socket structure. Remember that the “name” of a socket is the address and port number associated with the socketNext, it calls sock_map_fd to map the new socket into the pseudo socket filesystem. 
 
 
The functions, sock_read and sock_write set up an iovec type msghdr structure before calling sock_recvmsg and sock_sendmsg, respectively
 
Sock_setsockopt and sock_getsockopt are called from the system call if level is set to SOL_SOCKET. The purpose of these functions is to set values in the sock structure according to the options that were passed as a parameter by the application layer
 
Sock_setsockopt gets a pointer to the sock structure from the sk field of the socket structure, sock, which was passed as an argument. Next, it sets options in the sock structure, sk, based on the values pointed to by the optname and optval arguments.Refer to Section 5.3.1 (这里参见 Socket API 笔记 )for a description of the fields in the sock structure. If SO_DEBUG is set in optname, debug is set, reuse is set to the value of SO_REUSEADDR, localroute to the value of SO_DONTROUTE, no_check is set to the value of SO_NO_CHECK, and priority to the value of SO_PRIORITY.
 
Sock_getsockopt reverses what sock_setsockopt does. It retrieves certain values from the sock structure for the option socket and returns them to the user.
 
 
how each member protocol communicates with the socket layer ?
 
the file descriptor fd is used to map each socket API call with a function specific to each protocol
 
In addition, as we saw in Section 5.5.2, each of the protocols registers itself
with the protocol switch table. When the socket structure is initialized, as described in Section 5.4, the ops field was set to the set of protocol-specific operations from the entry in the protocol switch table
 
Once all the complex initialization is done as described in other sections, the actual mapping is quite simple. In most cases, the “sys_” versions of the socket functions simply call sockfd_lookup to get a pointer to the socket structure and call the protocol’s function through the ops field.
 
一个简单的例子:This function is called in the kernel when the user executes the getsockname socket API function to get the address (name) of a socket
 
这里顺便复习一下系统调用是怎么找到内核对应函数的:
 
getsockname() —-> sys_getsockname() —-> SYSCALL_DEFINE3(getsockname,…) (linux/net/socket.c)
 
/*
 *    Get the local address (‘name’) of a socket object. Move the obtained
 *    name to user space.
 */
 
SYSCALL_DEFINE3 (getsockname , int, fd, struct sockaddr __user *, usockaddr,
             int __user *, usockaddr_len)
{
       struct socket * sock;
       struct sockaddr_storage address;
       int len, err, fput_needed ;
 
       sock = sockfd_lookup_light (fd , &err , &fput_needed );
       if (!sock)
             goto out;
 
       err = security_socket_getsockname (sock );
       if (err)
             goto out_put;
 
       err = sock->ops->getname (sock , (struct sockaddr *)&address , &len , 0);
       if (err)
             goto out_put;
       err = move_addr_to_user ((struct sockaddr *)&address , len , usockaddr, usockaddr_len );
 
out_put:
       fput_light(sock->file , fput_needed );
out:
       return err;
}
 
 
Creation of a Socket :
 
Sock_create, defined in file linux /net /socket .c , is called from sys_socket. This function initiatesthe creation of a new socket 
int sock_create( int family, int type, int protocol, struct socket **res); 
 
First, sock_create verifies that family is one of the allowed family types shown in Table 5.1. Then, it allocates a socket by calling sock_alloc, which returns a new socket structure, sock. Sock_alloc, called from sock_create, returns an allocated socket structure. The socket structure is actually part of an inode structure, created when sock_alloc calls new_inode.
Once the inode is created, sock_alloc retrieves the socket structure from the inode. Then, it initializes a few fields in the socket structure.
 
Sockets maintain a state related to whether an open socket represents a connection to a peer or not
Linux TCP/IP 协议栈学习(4)—— Linux Socket (Part II)
 
The socket state really only reflects whether there is an active connection.
 
After returning from the sock_alloc call, sock_create calls the create function for the protocol family. It accesses the array net_families to get the family’s create function. For TCP/IP, family will be set to AF_INET. 
 
AF_INET is inet_create, is defined in fileaf_inet.c
static int inet_create( struct socket * sock, int protocol); 
 
In inet_create, we create a new sock structure called sk and initialize a few more fields.We call sk_alloc to allocate the sock structure from the slab cache that is 

specific to the protocol for this socket(  inet_sk_slab ).  

sk = sk_alloc (PF_INET , GFP_KERNEL , inet_sk_size( protocol),   
inet_sk_slab (protocol )); 
 
创建之后:
Sk points to the new slab cache. Next, inet_create searches the protocol switch table to look for a match from the protocol.  After getting the result from the search of the protocol switch table, the capability flags are checked against the capabilities of the current process, and if the caller doesn’t have permission to create this type of socket, the user level socket call will return the EPERM error.  
 
当创建了 sock 结构之后 inet_create 函数需要对 该 sock 结构 做如下的初始化:
 
inet_create will set some fields in the new sock data structure, however, many fields are pre-initialized when allocation is done from the slab cache. The field sk_family is set to PF_INET. The prot field is set to the protocol’s protocol block structure that defines the specific function for each of the transport protocols. No_check and ops are set according to their respective values in the protocol switch table. If the type of the socket is SOCK_RAW, the num field is set to the protocol number. As will be shown in later chapters, this field is used by IP to route packets internally depending on whether there is a raw socket open. The sk_destruct field of sk is set to inet_sock_destruct, the sock structure destructor. The sk_backlog_rcv field is set to
point to the protocol-specific backlog receive function. Next, some fields in the protocol family-specific part of the sock structure are initialized. 
 
通过这个宏,获取 inet_sock 类型的 inet 域
#define inet_sk(__sk) (&((struct inet_sock *)__sk)->inet) 
 
后面参见 inet_sock 结构的分析 和 proto 结构的分析
 
The sk_prot field in the sock structure points to the protocol block structure. The init field in the proto structure is specific for each protocol and socket type within the AF_INET protocol family. 
 
 
Netlink and Rtnetlink :
 
Netlink is an internal communication protocol. It mainly exists to transmit and receive messages between the application layer and various protocols in the Linux kernel. Netlink is implemented as a protocol with its own address family, AF_NETLINK. It supports most of the socket API functionsRtnetlink is a set of message extensions to the basic netlink protocol messages. The most common use of netlink is for applications to exchange routing information with the kernel’s internal routing table
 
 
Netlink sockets are accessed like any other sockets. Both socket calls and system IO calls will work with netlink sockets. For example, the sendmsg and recvmsg calls are generally used by user-level applications to add and delete routes. Both these calls pass a pointer to the nlmsghdr structure in the msg argument.
 
 
struct nlmsghdr {
       __u32       nlmsg_len ;   /* Length of message including header */
       __u16       nlmsg_type  /* Message content */
       __u16       nlmsg_flags ;       /* Additional flags */
       __u32       nlmsg_seq ;   /* Sequence number */
       __u32       nlmsg_pid ;   /* Sending process process ID */
};
 
Linux TCP/IP 协议栈学习(4)—— Linux Socket (Part II)
 Linux TCP/IP 协议栈学习(4)—— Linux Socket (Part II)
The netlink protocol is implemented in the file linux/netlink/af_netlink.c.
It is similar to UDP or TCP in that it defines a proto_ops structure to bind internal calls with socket calls made through the AF_NETLINK address family sockets.
 
static const struct proto_ops netlink_ops = {
      . family =   PF_NETLINK ,
      . owner =    THIS_MODULE ,
      . release =  netlink_release ,
      . bind =            netlink_bind,
      . connect =  netlink_connect ,
      . socketpair =     sock_no_socketpair ,
      . accept =   sock_no_accept ,
      . getname =  netlink_getname ,
      . poll =            datagram_poll,
      . ioctl =    sock_no_ioctl ,
      . listen =   sock_no_listen ,
      . shutdown sock_no_shutdown ,
      . setsockopt =     netlink_setsockopt ,
      . getsockopt =     netlink_getsockopt ,
/*Sendmsg and recvmsg are the main functions used to send and receive messages through 
AF_NETLINK sockets. */
      . sendmsg =  netlink_sendmsg ,
      . recvmsg =  netlink_recvmsg ,
      . mmap =            sock_no_mmap,
      . sendpage sock_no_sendpage ,
};
 
Just like other protocols , such as UDP and TCP that register with the socket layer,netlinkaddress family declares a global instance of the net_proto_family structure in the fileaf_netlink .c . 
struct net_proto_family netlink_family_ops = { 
    . family PF_NETLINK ,
    . create netlink_create ,
    . owner THIS_MODULE ,
} ; 
 
The netlink module also provides an initialization function for the protocol,netlink_proto_init
 
static int __init netlink_proto_init( void );
 
This function registers the netlink family operations with the socket layer by calling 

Linux TCP/IP 协议栈学习(3)—— Linux Socket (Part I)

Chapter 5: Linux Sockets 

Sockets provide a standard protocol-independent interface between the application-level programs and the TCP/IP stack. 
 
From the viewpoint of TCP/IP, everything above the transport layer is part of the application
 
The socket API is the best known networking interface for Unix application and network programming. 
 
 
One definition of the socket interface is that it is the interface between the transport layer protocols in the TCP/IP stack and all protocols above
 
the socket interface is also the interface between the kernel and the application layer for all network programming functions.
 
the socket interface is the only way that applications make use of the TCP/IP suite of protocols.  
 
Sockets have three fundamental purposes. They are used to transfer data, manage connections for TCP, and control or tune the operation of the TCP/IP stack. 
Linux TCP/IP 协议栈学习(3)—— Linux Socket (Part I)
 
once the socket is open, generic I/O calls such as read and write can be used to move data through the open socket.
 
结构体 sock inet_sock sock_common 参见专门的笔记,这里给一个总体的介绍
 
struct sock – network layer representation of sockets
struct sock_common – minimal network layer representation of sockets
 
struct inet_sock – representation of INET sockets,When a sock structure instance is allocated from the slab, following the sock structure is the 

inet_sock, which contains a protocol information part for IPv6 and IPv4.

 
The socket structure is the general structure that holds control and states information for the socket layer.
 
/**
 *  struct socket – general BSD socket
 *  @state: socket state (%SS_CONNECTED, etc)
 *  @type: socket type (%SOCK_STREAM, etc)
 *  @flags: socket flags (%SOCK_ASYNC_NOSPACE, etc)
 *  @ops: protocol specific socket operations
 *  @fasync_list: Asynchronous wake up list
 *  @file: File back pointer for gc
 *  @sk: internal networking protocol agnostic socket representation
 *  @wait: wait queue for several uses
 */
struct socket {
       socket_state            state ;
 
       kmemcheck_bitfield_begin(type);
       short              type;
       kmemcheck_bitfield_end(type);
 
       unsigned long            flags;
       /*
       * Please keep fasync_list & wait fields in the same cache line
       */
       struct fasync_struct    *fasync_list ;
       wait_queue_head_t wait;
 
       struct file       *file;
       struct sock       *sk;
/*Ops points to the protocol-specific operations for the socket*/
       const struct proto_ops   *ops ;
};
 
Linux TCP/IP 协议栈学习(3)—— Linux Socket (Part I)
 
proto_ops structure contains the family type for this particular set of socket operations. For IPv4, it will be set to AF_INET.  
 
struct proto_ops {
       int          family;
       struct module     * owner;
       int          (*release )   (struct socket *sock );
       int          (*bind )          ( struct socket * sock,
                             struct sockaddr * myaddr,
                             int sockaddr_len);
       int          (*connect )   (struct socket *sock ,
                             struct sockaddr * vaddr,
                             int sockaddr_len, int flags);
       int          (*socketpair )(struct socket *sock1 ,
                             struct socket * sock2);
       int          (*accept )    (struct socket *sock ,
                             struct socket * newsock, int flags);
       int          (*getname )   (struct socket *sock ,
                             struct sockaddr * addr,
                             int *sockaddr_len , int peer );
       unsigned int      (*poll)          (struct file *file , struct socket *sock,
                             struct poll_table_struct * wait);
       int          (*ioctl )     (struct socket *sock , unsigned int cmd,
                             unsigned long arg );
#ifdef CONFIG_COMPAT
       int          (*compat_ioctl ) (struct socket *sock , unsigned int cmd,
                             unsigned long arg );
#endif
       int          (*listen )    (struct socket *sock , int len );
       int          (*shutdown )  (struct socket *sock , int flags );
       int          (*setsockopt )(struct socket *sock , int level ,
                             int optname, char __user * optval, unsigned int optlen);
       int          (*getsockopt )(struct socket *sock , int level ,
                             int optname, char __user * optval, int __user *optlen);
#ifdef CONFIG_COMPAT
       int          (*compat_setsockopt )(struct socket *sock , int level ,
                             int optname, char __user * optval, unsigned int optlen);
       int          (*compat_getsockopt )(struct socket *sock , int level ,
                             int optname, char __user * optval, int __user *optlen);
#endif
       int          (*sendmsg )   (struct kiocb *iocb , struct socket *sock ,
                             struct msghdr * m, size_t total_len);
       int          (*recvmsg )   (struct kiocb *iocb , struct socket *sock ,
                             struct msghdr * m, size_t total_len,
                             int flags);
       int          (*mmap )          ( struct file * file, struct socket *sock,
                             struct vm_area_struct * vma);
       ssize_t            (*sendpage )  (struct socket *sock , struct page *page,
                             int offset, size_t size, int flags);
       ssize_t      (*splice_read )(struct socket *sock  loff_t * ppos,
                              struct pipe_inode_info * pipe, size_t len , unsigned int flags );
};
 
Socket Layer Initialization :
 
AF_INET is registered during kernel initialization, and the internal hooks that connect the AF_INET family with the TCP/IP protocol suite are done during socket initialization.
 
 
static int __init sock_init(void)
{
       /*
       *      Initialize sock SLAB cache.
       */
 
       sk_init();
 
       /*
       *      Initialize skbuff SLAB cache
       */
       skb_init();
 
       /*
       *      Initialize the protocols module.
       */
/*
we build the pseudo-file system for sockets, and the first step is to set up the socket inode 

cache. Linux, like other Unix operating systems, uses the inode as the basic unit for filesystem

implementation.
*/
       init_inodecache();
       register_filesystem(&sock_fs_type );
       sock_mnt = kern_mount (&sock_fs_type );
 
       /* The real protocol initialization is performed in later initcalls.
       */
 
#ifdef CONFIG_NETFILTER
       netfilter_init();
#endif
 
       return 0;
}
 
Family Values and the Protocol Switch Table :
 
the socket layer is used to interface with multiple protocol families and multiple protocols within a protocol family. 
 
After incoming packets are processed by the protocol stack, they eventually are passed up to the socket layer to be handed off to an application layer program. The socket layer must determine which socket should receive the packet, even though there may be multiple sockets open over different protocols. This is called socket de-multiplexing, and the protocol switch table is the core mechanism.
 
Figure 5.1 illustrates the registration process. It shows how the inet_protosw structure is initialized with proto and proto_ops structures for TCP/IP, the AF_INET family
Linux TCP/IP 协议栈学习(3)—— Linux Socket (Part I)
 
 
Each of the registered protocols is kept in a table called the protocol switch table. Each entry in the table is an instance of the inet_protosw,The registration function, inet_register_protosw, puts the protocol described by the argument p into the protocol switch tableThe unregistration function, inet_unregister_protowsw, removes a protocol described by the argument p from the protocol switch table.
 
Each protocol instance in the protocol switch table is an instance of the inet_protosw structure, defined in file linux/include/protocol.h.
/* This is used to register socket interfaces for IP protocols.  */
struct inet_protosw {
       struct list_head list;
 
        /* These two fields form the lookup key.  */
       unsigned short     type;   /* This is the 2nd argument to socket(2). */
/*
This is the protocol number for the protocol that is being registered.
*/
       unsigned short     protocol; /* This is the L4 protocol number.  */
/*
The field prot points to the protocol block structure. This structure is used when a socket is
created. This structure is used to build an interface to any protocol that supports a socket
interface. The next field, ops, points to a protocol-specific set of operation functions for this
protocol.
*/
       struct proto      *prot;
       const struct proto_ops *ops ;
 
       char             no_check;   /* checksum on rcv/xmit/none? */
/*
If flags is set to INET_PROTOSW_PERMANENT, the protocol is permanent and can’t be
unregistered.
*/
       unsigned char      flags;      /* See INET_PROTOSW_* below.  */
};
 
Linux TCP/IP 协议栈学习(3)—— Linux Socket (Part I)
 
The permanent protocols in IPv4 are registered by the function inet_init
具体参见 inet_init() 函数 笔记
 
static int __init inet_init(void)
. . .
/* The code actually registers the protocols after they have been placed into an array. */
    for (r = &inetsw [0]; r < &inetsw[SOCK_MAX ]; ++r)
        INIT_LIST_HEAD(r);
 
    for (q = inetsw_array ; q < &inetsw_array[ INETSW_ARRAY_LEN]; ++q)
        inet_register_protosw(q);
. . .
}  
 
The protocols in the array are UDP, TCP, and raw. The values for each protocol is initialized into the inet_protosw structure at compile time as shown here. 
static struct inet_protosw inetsw_array [] =
        { 
/* The first protocol is TCP, so type is SOCK_STREAM and flags is set to permanent. */
        { 
                      type:           SOCK_STREAM ,
                protocol:       IPPROTO_TCP ,
                prot:           &tcp_prot ,
                ops:            &inet_stream_ops ,
                capability:     1 ,
                no_check:       0,
                flags:          INET_PROTOSW_PERMANENT ,
        } ,
        { 
The second protocol is UDP, so type is SOCK_DGRAM and flags is also set to permanent
          { 
              type:           SOCK_DGRAM ,
                protocol:       IPPROTO_UDP ,
                prot:           &udp_prot ,
                ops:            &inet_dgram_ops ,
                capability:     1 ,
                no_check:       UDP_CSUM_DEFAULT ,
                flags:          INET_PROTOSW_PERMANENT ,
               } ,
       { 
The third protocol is “raw” , so type is SOCK_RAW and flags is also set to reuse. Notice the
protocol value is IPPROTO_IP, which is zero , and indicates the “wild card,” which means that a
raw socket can actually be used to set options in any protocol in the IF_INET family. This
corresponds to the fact that the protocol field is typically set to zero for a raw socket
              { 
 
        type:                  SOCK_RAW ,
               protocol:       IPPROTO_IP ,/* wild card */
               prot:           &raw_prot ,
               ops:            &inet_dgram_ops ,
               capability:     CAP_NET_RAW,
               no_check:       UDP_CSUM_DEFAULT ,
               flags:              INET_PROTOSW_REUSE ,
                } 
} ;
 
The socket layer family registration facility provides two functions and one key data structure
 
The first function, sock_register , registers the protocol family with the socket layer
 
int sock_register(struct net_proto_family *fam); 
 
static const struct net_proto_family * net_families[NPROTO] __read_mostly;
 
struct net_proto_family {
       int          family;
       int          (*create )(struct net *net , struct socket *sock ,
                         int protocol, int kern);
       struct module     * owner;
};
 

Linux TCP/IP 协议栈学习(2)—— 数据帧收发主要函数及net_device 结构

 

/**
 *    netif_rx    –     post buffer to the network code
 *    @skb: buffer to post
 *
 *    This function receives a packet from a device driver and queues it for
 *    the upper (protocol) levels to process.  It always succeeds. The buffer
 *    may be dropped during processing for congestion control or by the
 *    protocol layers.
 *
 *    return values:
 *    NET_RX_SUCCESS    (no congestion)
 *    NET_RX_DROP     (packet was dropped)
 *
 */
 
int netif_rx( struct sk_buff * skb)
{
       struct softnet_data * queue;
       unsigned long flags ;
 
       /* if netpoll wants it, pretend we never saw it */
       if (netpoll_rx (skb ))
             return NET_RX_DROP;
 
       if (!skb->tstamp .tv64 ) //得到帧接收的时间
             net_timestamp(skb);
 
       /*
       * The code is rearranged so that the path is the most
       * short when CPU is congested, but is still operating.
       */
       local_irq_save(flags);
       queue = &__get_cpu_var (softnet_data );//获取当前CPU的 softnet_data 数据
               
       __get_cpu_var(netdev_rx_stat ).total ++;//当前CPU接收的帧数+1
       if (queue->input_pkt_queue .qlen <= netdev_max_backlog) {
               //监测设备是否还有空间来存储帧,如果空间已满,表示网络阻塞严重,则返回一个错误,此后cpu将丢掉再来的帧。
             if (queue->input_pkt_queue .qlen ) {
enqueue: 
         
                                   //将该帧加入到softnet_data队列
                             __skb_queue_tail(&queue ->input_pkt_queue, skb);
                   local_irq_restore(flags);
                   return NET_RX_SUCCESS;
            }
//当队列是空的时候,表明这个队列并没有被软中断所schedule,因此我们需要将此队列加入到软中断的处理链表中。可以看到加入的正好是backlog,由于调用netif_rx的是非napi的驱动,因此backlog就是初始化时的process_backlog函数。
             napi_schedule(&queue ->backlog);
             goto enqueue;
      }
 
       __get_cpu_var(netdev_rx_stat ).dropped ++;
       local_irq_restore(flags);
 
       kfree_skb(skb);
       return NET_RX_DROP;
}
// 上面代码中用到一个关键的数据结构 softnet_data ,在网卡收发数据的时候,需要维护一个缓冲区队列,来缓存可能存在的突发数据,在协议栈中用一个队列层来表示该缓冲区,队列层位于数据链路层和网络层之间。softnet_data 就是数据链路层中的数据结构,它是一个Per-CPU变量,每个CPU都有一个
 
 
/**
 *    netif_receive_skb – process receive buffer from network
 *    @skb: buffer to process
 *
 *    netif_receive_skb() is the main receive data processing function.
 *    It always succeeds. The buffer may be dropped during processing
 *    for congestion control or by the protocol layers.
 *
 *    This function may only be called from softirq context and interrupts
 *    should be enabled.
 *
 *    Return values (usually ignored):
 *    NET_RX_SUCCESS: no congestion
 *    NET_RX_DROP: packet was dropped
 */
//netif_receive_skb 是对于 netif_rx 的 NAPI 对等函数; 它递交一个报文给内核. 当一个 NAPI 兼容的驱动已耗尽接收报文的供应, 它应当重开中断, 并且调用 netif_rx_complete(现在是 __napi_complete()) 来停止轮询.
int netif_receive_skb( struct sk_buff * skb)
{
       struct packet_type * ptype, *pt_prev ;
       struct net_device * orig_dev;
       struct net_device * master;
       struct net_device * null_or_orig;
       struct net_device * null_or_bond;
       int ret = NET_RX_DROP;
       __be16 type;
 
       if (!skb->tstamp .tv64 )
             net_timestamp(skb);
 
       if (vlan_tx_tag_present (skb ) && vlan_hwaccel_do_receive(skb))
             return NET_RX_SUCCESS;
 
       /* if we’ve gotten here through NAPI, check netpoll */
       if (netpoll_receive_skb (skb ))
             return NET_RX_DROP;
 
       if (!skb->skb_iif )
             skb->skb_iif = skb ->dev-> ifindex;// 记录帧的入口
 
       null_or_orig = NULL;
       orig_dev = skb->dev;
       master = ACCESS_ONCE (orig_dev ->master);
       if (master) {
             if (skb_bond_should_drop (skb , master ))
                   null_or_orig = orig_dev ; /* deliver only exact match */
             else
                   skb->dev = master ;
      }
 
       __get_cpu_var(netdev_rx_stat ).total ++;
 
       skb_reset_network_header(skb);
       skb_reset_transport_header(skb);
       skb->mac_len = skb ->network_header  skb->mac_header ;
 
       pt_prev = NULL;
 
       rcu_read_lock();
 
#ifdef CONFIG_NET_CLS_ACT
       if (skb->tc_verd & TC_NCLS) {
             skb->tc_verd = CLR_TC_NCLS( skb->tc_verd );
             goto ncls;
      }
#endif
          //处理 ptype_all 上所有的 packet_type->func() ,这里先提一下Linux 是根据packet_type 通过 dev_add_pack() 函数来注册相应的处理函数,后面会讲如何注册,每种包对应哪个处理函数
          // static struct list_head ptype_all __read_mostly;   
       list_for_each_entry_rcu(ptype, &ptype_all , list ) {
             if (ptype->dev == null_or_orig || ptype->dev == skb-> dev ||
               ptype->dev == orig_dev) {
                   if (pt_prev)
                         ret = deliver_skb (skb , pt_prev , orig_dev );//调用相应的包处理函数
                   pt_prev = ptype;
            }
      }
 
#ifdef CONFIG_NET_CLS_ACT
       skb = handle_ing (skb , &pt_prev , &ret , orig_dev );
       if (!skb)
             goto out;
ncls:
#endif
               //若编译内核时选上BRIDGE,下面会执行网桥模块
       skb = handle_bridge (skb , &pt_prev , &ret , orig_dev );
       if (!skb)
             goto out;
              //编译内核时选上MAC_VLAN模块,下面才会执行
       skb = handle_macvlan (skb , &pt_prev , &ret , orig_dev );
       if (!skb)
             goto out;
 
       /*
       * Make sure frames received on VLAN interfaces stacked on
       * bonding interfaces still make their way to any base bonding
       * device that may have registered for a specific ptype.  The
       * handler may have to adjust skb->dev and orig_dev.
       */
       null_or_bond = NULL;
       if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
         (vlan_dev_real_dev( skb->dev)->priv_flags & IFF_BONDING)) {
             null_or_bond = vlan_dev_real_dev (skb ->dev);
      }
     //最后 type = skb->protocol; &ptype_base[ntohs(type)&15]处理ptype_base[ntohs(type)&15]上的所有的 packet_type->func(),根据第二层不同协议来进入不同的钩子函数,重要的有:ip_rcv(), arp_rcv()
       type = skb->protocol ;
       list_for_each_entry_rcu(ptype,
                   &ptype_base[ntohs (type ) & PTYPE_HASH_MASK], list) {
             if (ptype->type == type && (ptype ->dev == null_or_orig ||
                ptype->dev == skb-> dev || ptype->dev == orig_dev ||
                ptype->dev == null_or_bond)) {
                   if (pt_prev)
                         ret = deliver_skb (skb , pt_prev , orig_dev );
                   pt_prev = ptype;
            }
      }
 
       if (pt_prev) {
             ret = pt_prev ->func( skb, skb->dev, pt_prev , orig_dev );
      } else {
             kfree_skb(skb);
             /* Jamal, now you will not able to escape explaining
             * me how you were going to use this. 🙂
             */
             ret = NET_RX_DROP ;
      }
 
out:
       rcu_read_unlock();
       return ret;
}
 
 
/**
 *    dev_queue_xmit – transmit a buffer
 *    @skb: buffer to transmit
 *
 *    Queue a buffer for transmission to a network device. The caller must
 *    have set the device and priority and built the buffer before calling
 *    this function. The function can be called from an interrupt.
 *
 *    A negative errno code is returned on a failure. A success does not
 *    guarantee the frame will be transmitted as it may be dropped due
 *    to congestion or traffic shaping.
 *
 * ———————————————————————————–
 *      I notice this method can also return errors from the queue disciplines,
 *      including NET_XMIT_DROP, which is a positive value.  So, errors can also
 *      be positive.
 *
 *      Regardless of the return value, the skb is consumed, so it is currently
 *      difficult to retry a send to this method.  (You can bump the ref count
 *      before sending to hold a reference for retry if you are careful.)
 *
 *      When calling this method, interrupts MUST be enabled.  This is because
 *      the BH enable code must have IRQs enabled so that it will not deadlock.
 *          –BLG
 */
int dev_queue_xmit( struct sk_buff * skb)
{
       struct net_device * dev = skb->dev;
       struct netdev_queue * txq;
       struct Qdisc * q;
       int rc = – ENOMEM;
 
       /* GSO will handle the following emulations directly. */
       if (netif_needs_gso (dev , skb ))//如果是GSO数据包,且设备支持GSO数据包的处理
             goto gso;
 
       /* Convert a paged skb to linear, if required */
       if (skb_needs_linearize (skb , dev ) && __skb_linearize(skb))
             goto out_kfree_skb;
 
       /* If packet is not checksummed and device does not support
       * checksumming for this protocol, complete checksumming here.
       */
       if (skb->ip_summed == CHECKSUM_PARTIAL) {
             skb_set_transport_header(skb, skb->csum_start 
                                   skb_headroom(skb));
             if (!dev_can_checksum (dev , skb ) && skb_checksum_help(skb))
                   goto out_kfree_skb;
      }
 
gso:
       /* Disable soft irqs for various locks below. Also
       * stops preemption for RCU.
       */
       rcu_read_lock_bh();
 
       txq = dev_pick_tx (dev , skb );
       q = rcu_dereference_bh(txq->qdisc );
 
#ifdef CONFIG_NET_CLS_ACT
       skb->tc_verd = SET_TC_AT( skb->tc_verd , AT_EGRESS );
#endif
       if (q->enqueue ) {
             rc = __dev_xmit_skb (skb , q , dev , txq );
             goto out;
      }
 
       /* The device has no queue. Common case for software devices:
         loopback, all the sorts of tunnels…
 
         Really, it is unlikely that netif_tx_lock protection is necessary
         here.  (f.e. loopback and IP tunnels are clean ignoring statistics
         counters.)
         However, it is possible, that they rely on protection
         made by us here.
 
         Check this and shot the lock. It is not prone from deadlocks.
         Either shot noqueue qdisc, it is even simpler 8)
       */
       if (dev->flags & IFF_UP) {
             int cpu = smp_processor_id(); /* ok because BHs are off */
 
             if (txq->xmit_lock_owner != cpu) {
 
                   HARD_TX_LOCK(dev, txq, cpu);
 
                   if (!netif_tx_queue_stopped (txq )) {
                         rc = dev_hard_start_xmit (skb , dev , txq );
                         if (dev_xmit_complete (rc )) {
                               HARD_TX_UNLOCK(dev, txq);
                               goto out;
                        }
                  }
                   HARD_TX_UNLOCK(dev, txq);
                   if (net_ratelimit ())
                         printk(KERN_CRIT “Virtual device %s asks to “
                               “queue packet!/n” , dev ->name);
            } else {
                   /* Recursion is detected! It is possible,
                   * unfortunately */
                   if (net_ratelimit ())
                         printk(KERN_CRIT “Dead loop on virtual device “
                               “%s, fix it urgently!/n” , dev ->name);
            }
      }
 
       rc = –ENETDOWN ;
       rcu_read_unlock_bh();
 
out_kfree_skb:
       kfree_skb(skb);
       return rc;
out:
       rcu_read_unlock_bh();
       return rc;
}
 
数据链路层不得不谈到 struct net_device 相关结构,在2.6.29之后 net_device 结构进行了调整,操作函数被重构到了 net_device_ops 中。下面简要分析一下:
struct net_device 
{
/*
This first field, name, is the beginning of the visible part of this structure. It contains the string
that is the name of the interface. By visible, we mean that this part of the data structure is generic
and doesn’t contain any private areas specific to a particular type of device
.
*/
       char               name[IFNAMSIZ ];
       /* device name hash chain */
       struct hlist_node name_hlist;
       /* snmp alias */
       char               *ifalias ;
 
       /*
       *    I/O specific fields
       *    FIXME: Merge these and struct ifmap into one
       */
       unsigned long            mem_end;     /* shared mem end */
       unsigned long            mem_start;   /* shared mem start     */
       unsigned long            base_addr;   /* device I/O address   */
       unsigned int             irq;         /* device IRQ number    */
 
      /*
       *    Some hardware also needs these fields, but they are not
       *    part of the usual set specified in Space.c.
       */
 
       unsigned char            if_port;    /* Selectable AUI, TP,..*/
       unsigned char            dma;        /* DMA channel          */
 
       unsigned long            state;
/*
Linux TCP/IP 协议栈学习(2)—— 数据帧收发主要函数及net_device 结构
*/
       struct list_head  dev_list;
       struct list_head  napi_list;
       struct list_head  unreg_list;
 
       /* Net device features */
       unsigned long            features;
/*
Linux TCP/IP 协议栈学习(2)—— 数据帧收发主要函数及net_device 结构
*/
#define NETIF_F_SG            1     /* Scatter/gather IO. */
#define NETIF_F_IP_CSUM       2     /* Can checksum TCP/UDP over IPv4. */
#define NETIF_F_NO_CSUM       4     /* Does not require checksum. F.e. loopack. */
#define NETIF_F_HW_CSUM       8     /* Can checksum all the packets. */
#define NETIF_F_IPV6_CSUM     16    /* Can checksum TCP/UDP over IPV6 */
#define NETIF_F_HIGHDMA       32    /* Can DMA to high memory. */
#define NETIF_F_FRAGLIST      64    /* Scatter/gather IO. */
#define NETIF_F_HW_VLAN_TX    128   /* Transmit VLAN hw acceleration */
#define NETIF_F_HW_VLAN_RX    256   /* Receive VLAN hw acceleration */
#define NETIF_F_HW_VLAN_FILTER      512   /* Receive filtering on VLAN */
#define NETIF_F_VLAN_CHALLENGED     1024  /* Device cannot handle VLAN packets */
#define NETIF_F_GSO           2048  /* Enable software GSO. */
#define NETIF_F_LLTX          4096  /* LockLess TX – deprecated. Please */
                              /* do not use LLTX in new drivers */
#define NETIF_F_NETNS_LOCAL   8192  /* Does not change network namespaces */
#define NETIF_F_GRO           16384 /* Generic receive offload */
#define NETIF_F_LRO           32768 /* large receive offload */
 
/* the GSO_MASK reserves bits 16 through 23 */
#define NETIF_F_FCOE_CRC      (1 << 24) /* FCoE CRC32 */
#define NETIF_F_SCTP_CSUM     (1 << 25) /* SCTP checksum offload */
#define NETIF_F_FCOE_MTU      (1 << 26) /* Supports max FCoE MTU, 2158 bytes*/
#define NETIF_F_NTUPLE        (1 << 27) /* N-tuple filters supported */
 
      /* Segmentation offload features */
#define NETIF_F_GSO_SHIFT     16
#define NETIF_F_GSO_MASK      0x00ff0000
#define NETIF_F_TSO           (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
#define NETIF_F_UFO           (SKB_GSO_UDP << NETIF_F_GSO_SHIFT)
#define NETIF_F_GSO_ROBUST    (SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO_ECN       (SKB_GSO_TCP_ECN << NETIF_F_GSO_SHIFT)
#define NETIF_F_TSO6          (SKB_GSO_TCPV6 << NETIF_F_GSO_SHIFT)
#define NETIF_F_FSO           (SKB_GSO_FCOE << NETIF_F_GSO_SHIFT)
 
      /* List of features with software fallbacks. */
#define NETIF_F_GSO_SOFTWARE  (NETIF_F_TSO | NETIF_F_TSO_ECN | NETIF_F_TSO6)
 
 
#define NETIF_F_GEN_CSUM      (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
#define NETIF_F_V4_CSUM       (NETIF_F_GEN_CSUM | NETIF_F_IP_CSUM)
#define NETIF_F_V6_CSUM       (NETIF_F_GEN_CSUM | NETIF_F_IPV6_CSUM)
#define NETIF_F_ALL_CSUM      (NETIF_F_V4_CSUM | NETIF_F_V6_CSUM)
 
      /*
       * If one device supports one of these features, then enable them
       * for all in netdev_increment_features.
       */
#define NETIF_F_ONE_FOR_ALL   (NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ROBUST | /
                         NETIF_F_SG | NETIF_F_HIGHDMA |             /
                         NETIF_F_FRAGLIST )
 
       /* Interface index. Unique device identifier    */
       int                ifindex;
       int                iflink;
 
       struct net_device_stats stats;
 
#ifdef CONFIG_WIRELESS_EXT
      /* List of functions to handle Wireless Extensions (instead of ioctl).
       * See <net/iw_handler.h> for details. Jean II */
       const struct iw_handler_def  wireless_handlers;
       /* Instance data managed by the core of Wireless Extensions. */
       struct iw_public_data  wireless_data;
#endif
      /* Management operations */
       const struct net_device_ops *netdev_ops ;
       const struct ethtool_ops *ethtool_ops ;
 
       /* Hardware header description */
       const struct header_ops *header_ops ;
 
       unsigned int             flags;       /* interface flags (a la BSD)      */
       unsigned short           gflags;
        unsigned short          priv_flags; /* Like ‘flags’ but invisible to userspace. */
       unsigned short           padded;     /* How much padding added by alloc_netdev() */
 
       unsigned char            operstate; /* RFC2863 operstate */
       unsigned char            link_mode; /* mapping policy to operstate */
 
       unsigned           mtu;  /* interface MTU value        */
       unsigned short           type/* interface hardware type    */
       unsigned short           hard_header_len;  /* hardware hdr length      */
 
      /* extra head- and tailroom the hardware may need, but not in all cases
       * can this be guaranteed, especially tailroom. Some cases also use
       * LL_MAX_HEADER instead to allocate the skb.
       */
       unsigned short           needed_headroom;
       unsigned short           needed_tailroom;
 
       struct net_device * master; /* Pointer to master device of a group,
                                * which this device is member of.
                                */
 
      /* Interface address info. */
       unsigned char            perm_addr[MAX_ADDR_LEN ]; /* permanent hw address */
       unsigned char            addr_len;   /* hardware address length      */
       unsigned short          dev_id;           /* for shared network cards */
 
       struct netdev_hw_addr_list    uc;   /* Secondary unicast
                                       mac addresses */
       int                uc_promisc;
       spinlock_t        addr_list_lock ;
       struct dev_addr_list    *mc_list;   /* Multicast mac addresses      */
       int                mc_count;   /* Number of installed mcasts */
       unsigned int             promiscuity;
       unsigned int             allmulti;
 
 
       /* Protocol specific pointers */
      
 #ifdef CONFIG_NET_DSA
       void               *dsa_ptr ;   /* dsa specific data */
#endif
       void               *atalk_ptr /* AppleTalk link       */
       void               *ip_ptr ;    /* IPv4 specific data   */
       void                    *dn_ptr ;        /* DECnet specific data */
       void                    *ip6_ptr ;       /* IPv6 specific data */
       void               *ec_ptr ;    /* Econet specific data */
       void               *ax25_ptr ;  /* AX.25 specific data */
       struct wireless_dev     *ieee80211_ptr ;   /* IEEE 802.11 specific data,
                                       assign before registering */
 
/*
 * Cache line mostly used on receive path (including eth_type_trans())
 */
       unsigned long            last_rx;     /* Time of last Rx      */
      /* Interface address info used in eth_type_trans() */
       unsigned char            *dev_addr ;  /* hw address, (before bcast
                                       because most packets are
                                       unicast) */
 
       struct netdev_hw_addr_list    dev_addrs; /* list of device
                                          hw addresses */
 
       unsigned char            broadcast[MAX_ADDR_LEN ];       /* hw bcast add   */
 
       struct netdev_queue     rx_queue;
 
       struct netdev_queue     *_tx ____cacheline_aligned_in_smp ;
 
       /* Number of TX queues allocated at alloc_netdev_mq() time  */
       unsigned int             num_tx_queues;
 
       /* Number of TX queues currently active in device  */
       unsigned int             real_num_tx_queues;
 
       /* root qdisc from userspace point of view */
       struct Qdisc            *qdisc;
 
       unsigned long            tx_queue_len;      /* Max frames per queue allowed */
       spinlock_t        tx_global_lock ;
/*
 * One part is mostly used on xmit path (device)
 */
      /* These may be needed for future network-power-down code. */
 
      /*
       * trans_start here is expensive for high speed devices on SMP,
       * please use netdev_queue->trans_start instead.
       */
       unsigned long            trans_start;       /* Time (in jiffies) of last Tx     */
 
       int                watchdog_timeo; /* used by dev_watchdog() */
       struct timer_list watchdog_timer;
 
       /* Number of references to this device */
       atomic_t          refcnt ____cacheline_aligned_in_smp ;
 
       /* delayed register/unregister */
       struct list_head  todo_list;
       /* device index hash chain */
       struct hlist_node index_hlist;
 
       struct list_head  link_watch_list;
 
       /* register/unregister state machine */
       enum { NETREG_UNINITIALIZED =0,
            NETREG_REGISTERED,      /* completed register_netdevice */
            NETREG_UNREGISTERING,   /* called unregister_netdevice */
            NETREG_UNREGISTERED,    /* completed unregister todo */
            NETREG_RELEASED,        /* called free_netdev */
            NETREG_DUMMY,           /* dummy device for NAPI poll */
       } reg_state: 16;
 
       enum {
             RTNL_LINK_INITIALIZED,
             RTNL_LINK_INITIALIZING,
      } rtnl_link_state:16;
 
       /* Called from unregister, can be used to call free_netdev */
       void (*destructor )(struct net_device *dev );
 
#ifdef CONFIG_NETPOLL
       struct netpoll_info     *npinfo;
#endif
 
#ifdef CONFIG_NET_NS
      /* Network namespace this network device is inside */
       struct net        *nd_net;
#endif
 
      /* mid-layer private */
       void               *ml_priv ;
 
       /* bridge stuff */
       struct net_bridge_port  * br_port;
       /* macvlan */
       struct macvlan_port     *macvlan_port ;
       /* GARP */
       struct garp_port  * garp_port;
 
       /* class/net/name entry */
       struct device           dev;
       /* space for optional device, statistics, and wireless sysfs groups */
       const struct attribute_group *sysfs_groups [4];
 
       /* rtnetlink link ops */
       const struct rtnl_link_ops *rtnl_link_ops ;
 
       /* VLAN feature mask */
       unsigned long vlan_features ;
 
       /* for setting kernel sock attribute on TCP connection setup */
#define GSO_MAX_SIZE          65536
       unsigned int             gso_max_size;
 
#ifdef CONFIG_DCB
      /* Data Center Bridging netlink ops */
       const struct dcbnl_rtnl_ops *dcbnl_ops ;
#endif
 
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
      /* max exchange id for FCoE LRO by ddp */
       unsigned int             fcoe_ddp_xid;
#endif
      /* n-tuple filter list attached to this device */
       struct ethtool_rx_ntuple_list ethtool_ntuple_list;
};
 
 

在多个线程中避免和发现伪共享

原文地址:http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

Avoiding and Identifying False Sharing Among Threads

Abstract

In symmetric multiprocessor (SMP) systems, each processor has a local cache. The memory system must guarantee cache coherence. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces an update, which hurts performance. This article covers methods to detect and correct false sharing.

This article is part of the larger series(提高多线程编程能力,可以看一下这一系列的文章), “Intel Guide for Developing Multithreaded Applications,” which provides guidelines for developing efficient multithreaded applications for Intel® platforms.

Background

False sharing is a well-known performance issue on SMP systems, where each processor has a local cache. It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in Figure 1. This circumstance is called false sharing because each thread is not actually sharing access to the same variable. Access to the same variable, or true sharing, would require programmatic synchronization constructs to ensure ordered data access.

The source line shown in red in the following example code causes false sharing

01 double sum=0.0, sum_local[NUM_THREADS];
02 #pragma omp parallel num_threads(NUM_THREADS)
03 {
04  int me = omp_get_thread_num();
05  sum_local[me] = 0.0;
06  
07  #pragma omp for
08  for (i = 0; i < N; i++)
09  sum_local[me] += x[i] * y[i];
10  
11  #pragma omp atomic
12  sum += sum_local[me];
13 }

 There is a potential for false sharing on array sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line shown in red), which invalidates the cache line for all processors.

 在多个线程中避免和发现伪共享


Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

In Figure 1, threads 0 and 1 require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU 0 and CPU 1 (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.

To ensure data consistency across multiple caches, multiprocessor-capable Intel® processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the cache line is marked exclusive, subsequent loads are free to use the existing data in cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.

The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.

Since compilers are aware of false sharing, they do a good job of eliminating instances where it could occur. For example, when the above code is compiled with optimization options, the compiler eliminates false sharing using thread-private temporal variables. Run-time false sharing from the above code will be only an issue if the code is compiled with optimization disabled.

Advice

The primary means of avoiding false sharing is through code inspection. Instances where threads access global or dynamically allocated shared data structures are potential sources of false sharing. Note that false sharing can be obscured by the fact that threads may be accessing completely different global variables that happen to be relatively close together in memory. Thread-local storage or local variables can be ruled out as sources of false sharing.

The run-time detection method is to use the Intel® VTune™ Performance Analyzer or Intel® Performance Tuning Utility (Intel PTU, available at /en-us/articles/intel-performance-tuning-utility/). This method relies on event-based sampling that discovers places where cacheline sharing exposes performance visible effects. However, such effects don’t distinguish between true and false sharing.

For systems based on the Intel® Core™ 2 processor, configure VTune analyzer or Intel PTU to sample the MEM_LOAD_RETIRED.L2_LINE_MISS and EXT_SNOOP.ALL_AGENTS.HITM events. For systems based on the Intel® Core i7 processor, configure to sample MEM_UNCORE_RETIRED.OTHER_CORE_L2_HITM. If you see a high occurrence of EXT_SNOOP.ALL_AGENTS.HITM events, such that it is a fraction of percent or more of INST_RETIRED.ANY events at some code regions on Intel® Core™ 2 processor family CPUs, or a high occurrence of MEM_UNCORE_RETIRED.OTHER_CORE_L2_HITM events on Intel® Core i7 processor family CPU, you have true or false sharing. Inspect the code of concentration of MEM_LOAD_RETIRED.L2_LINE_MISS and MEM_UNCORE_RETIRED.OTHER_CORE_L2_HITM events at the corresponding system at or near load/store instructions within threads to determine the likelihood that the memory locations reside on the same cache line and causing false sharing.

Intel PTU comes with predefined profile configurations to collect events that will help to locate false sharing. These configurations are “Intel® Core™ 2 processor family – Contested Usage” and “Intel® Core™ i7 processor family – False-True Sharing.” Intel PTU Data Access analysis identifies false sharing candidates by monitoring different offsets of the same cacheline accessed by different threads. When you open the profiling results in Data Access View, the Memory Hotspot pane will have hints about false sharing at the cacheline granularity, as illustrated in Figure 2.

在多个线程中避免和发现伪共享
 Figure 2. False sharing shown in Intel PTU Memory Hotspots pane.

In Figure 2, memory offsets 32 and 48 (of the cacheline at address 0x00498180) were accessed by the ID=59 thread and the ID=62 thread at the work function. There is also some true sharing due to array initialization done by the ID=59 thread.

The pink color is used to hint about false sharing at a cacheline. Note the high figures for MEM_UNCORE_RETIRED.OTHER_CORE_L2_HITM associated with the cacheline and its corresponding offsets.

Once detected, there are several techniques to correct false sharing. The goal is to ensure that variables causing false sharing are spaced far enough apart in memory that they cannot reside on the same cache line. While the following is not an exhaustive list three possible methods are discussed below.
One technique is to use compiler directives to force individual variable alignment. The following source code demonstrates the compiler technique using __declspec (align(n)) where n equals 64 (64 byte boundary) to align the individual variables on cache line boundaries.

__declspec (align(64)) int thread1_global_variable;
__declspec (align(64)) int thread2_global_variable;

When using an array of data structures, pad the structure to the end of a cache line to ensure that the array elements begin on a cache line boundary. If you cannot ensure that the array is aligned on a cache line boundary, pad the data structure to twice the size of a cache line. The following source code demonstrates padding a data structure to a cache line boundary and ensuring the array is also aligned using the compiler __declspec (align(n)) statement where n equals 64 (64 byte boundary). If the array is dynamically allocated, you can increase the allocation size and adjust the pointer to align with a cache line boundary.

01 struct ThreadParams
02 {
03  // For the following 4 variables: 4*4 = 16 bytes
04  unsigned long thread_id;
05  unsigned long v; // Frequent read/write access variable
06  unsigned long start;
07  unsigned long end;
08  
09  // expand to 64 bytes to avoid false-sharing
10  // (4 unsigned long variables + 12 padding)*4 = 64
11  int padding[12];
12 };
13  
14 __declspec (align(64)) struct ThreadParams Array[10];

 反正就是还是要使用 __declspec(align(64)) 让两个变量之间的距离是64byte,否则还是会出现false sharing,只不过这里本来每一个结构体的大小都是64了,自己就对齐了吧? 可以不用__declspec(align(64)) 了撒? 但是这里__declspec(align(64)) 是强制每个变量的开始在每一个cache line的开始,所以还是必须的。

It is also possible to reduce the frequency of false sharing by using thread-local copies of data. The thread-local copy can be read and modified frequently and only when complete, copy the result back to the data structure. The following source code demonstrates using a local copy to avoid false sharing.

01 struct ThreadParams
02 {
03  // For the following 4 variables: 4*4 = 16 bytes
04  unsigned long thread_id;
05  unsigned long v; //Frequent read/write access variable
06  unsigned long start;
07  unsigned long end;
08 };
09  
10 void threadFunc(void *parameter)
11 {
12  ThreadParams *p = (ThreadParams*) parameter;
13  // local copy for read/write access variable
14  unsigned long local_v = p->v;
15  
16  for(local_v = p->start; local_v < p->end; local_v++)
17  {
18  // Functional computation
19  }
20  
21  p->v = local_v; // Update shared data structure only once
22 }

 

Usage Guidelines

Avoid false sharing but use these techniques sparingly. Overuse can hinder the effective use of the processor’s available cache. Even with multiprocessor shared-cache designs, avoiding false sharing is recommended. The small potential gain for trying to maximize cache utilization on multi-processor shared cache designs does not generally outweigh the software maintenance costs required to support multiple code paths for different cache architectures.

Additional Resources

这里再补充一个小例子(来自维基百科):

Example

struct foo {
    int x;
    int y; 
};
 
static struct foo f;
 
/* The two following functions are running concurrently: */
 
int sum_a(void){
    int s = 0;
    int i;
    for (i = 0; i < 1000000; ++i)
        s += f.x;
    return s;}
 
void inc_b(void){
    int i;
    for (i = 0; i < 1000000; ++i)
        ++f.y;}

Here, sum_a may need to continually re-read x from main memory (instead of from cache) even though inc_b‘s modification of y should be irrelevant.

记得有一次面试,被问到false sharing。

Linux TCP-IP 协议栈(1)——协议概览,数据链路层和驱动

Chap-3:  TCP/IP in Embedded Systems

 
Two guiding principles allow protocol stacks to be implemented as shown in the OSI model: information hiding and encapsulation.
 
The physical layer (PHY) is responsible for the modulation and electrical details of data transmission.
 
One of the responsibilities of the data link layer is to provide an error-free transmission channel known as Connection Oriented (CO) service.Another function of the link layer is to establish the type of framing to be used when the IP packet is transmitted
 
Network Layer(IP Layer) contains the knowledge of network topology. It includes the routing protocols and understands the network addressing scheme. Although the main responsibility of this layer is routing of packets, it also provides fragmentation to break large packets into smaller pieces so they can be transmitted across an interface that has a small Maximum Transmission Unit (MTU). Another function of IP is the capability to multiplex incoming packets destined for each of the transport protocols.
 
Differentiating among the classes of addresses is a significant function of the IP layer. There are three fundamental types of IP addresses: unicast addresses for sending a packet to an individual destination, multicast addresses for sending data to multiple destinations, and broadcast addresses for sending packets to everyone within reach
 
The purpose of ARP is to determine what the physical destination address should be that corresponds to the destination IP address. 
 
The transport layer in TCP/IP consists of two major protocols. It contains a connection-oriented service reliable service otherwise known as a streaming service provided by the TCP protocol. In addition, TCP/IP includes an individual packet transmission service known as an unreliable or datagram service, which is provided by UDP
 
TCP divides the data stream into segments. The sequence number and acknowledgment numberfields are byte pointers that keep track of the position of the segments within the data stream.
 
 
A main advantage of sockets in the Unix or Linux environment is that the socket is treated as a file descriptor, and all the standard IO functions work on sockets in the same way they work on a local file. 
 
 
The session layer can be thought of as analogous to a signaling protocol where information is exchanged between end points about how to set up a session. 
One widely used session layer protocol is the Telnet protocol
 
Specific Requirements for Embedded OSs:
     Timer facility; Concurrency and multitasking; Buffer management; Link layer facility; Low latency; Minimal data copying 
 
 
Chap-4:  Linux Networking Interfaces and Device Drivers
 
 
The TCP/IP stack provides a registration mechanism between the device drivers and the layer above, and this registration mechanism allows the output routines in the networking layer, such as IP, to call the driver’s transmit function for a specific interface port without needing to know the driver’s internal details.
 
The net_device structure, defined in file linux/include/netdevice.h, is the data structure that defines an instance of a network interface. It tracks the state information of all the network interface devices attached to the TCP/IP stack.
 
Network Device Initialization:
 
When the network interface driver’s initialization or probe function is called, the first thing it does is allocate the driver’s private data structure,Next, it must set a few key fields in the structure,calling dev_alloc_name to set up the name string, and then directly initializing the other device-specific fields in the net_device structure.
 
alloc_etherdev calls alloc_netdev and passes it a pointer to a setup function as the second argument.
 
The initialization function in each driver must allocate the net_device structure, which is used to connect the network interface driver with the network layer protocols
 
Once the net_device structure is initialized, we can register it. 
 
if ((rc = register_netdev (dev )) != 0) { 
        goto err_dealloc;
    }  
 Linux TCP-IP 协议栈(1)——协议概览,数据链路层和驱动
 
 
 
struct pci_driver {
       struct list_head node;
       char *name;
       const struct pci_device_id *id_table ;     /* pointer to the PCI configuration space information must be non-NULL for probe to be called */
       int  (*probe)  (struct pci_dev *dev, const struct pci_device_id * id);  /* New device inserted */
       void (*remove) (struct pci_dev *dev);     /* Device removed (NULL if not a hot-plug capable driver) */
       int  (*suspend ) (struct pci_dev *dev , pm_message_t state );  /* Device suspended */
       int  (*suspend_late ) (struct pci_dev *dev , pm_message_t state );
       int  (*resume_early ) (struct pci_dev *dev );
       int  (*resume) (struct pci_dev *dev);                    /* Device woken up */
       void (*shutdown ) (struct pci_dev *dev );
       struct pci_error_handlers * err_handler;
       struct device_driver    driver;
       struct pci_dynids dynids;
};
 
After all the initialization of the net device structure and the associated private data structure is complete, the driver can be registered as a networking device.
 
Network device registration consists of putting the driver’s net_device structure on a linked list.
Most of the functions involved in network device registration use the name field in the net_device structure.This is why driver writers should use the dev_alloc_name function to ensure that the name field is formatted properly.
 
The list of net devices is protected by the netlink mutex locking and unlocking functions, rtnl_lock and rtnl_unlock
The list of net devices is protected by the netlink mutex locking and unlocking functions, rtnl_lock and rtnl_unlock. The list of devices should not be manipulated without locking because if the locks are not used, it is possible for the device list to become corrupted or two devices that try to register in parallel to be assigned the same name. 
int register_netdev( struct net_device * dev)
{
       int err;
 
       rtnl_lock();
 
       /*
       * If the name is a format string the caller wants us to do a
       * name allocation.
       */
       if (strchr (dev -> name, ‘%’)) {
             err = dev_alloc_name (dev , dev -> name);
             if (err < 0 )
                   goto out;
      }
 
       err = register_netdevice (dev );
out:
       rtnl_unlock();
       return err;
}
 
Network Device Registration Utility Functions :
 
The first function dev_get_by_name finds a device by name. It can be called from any context because it does its own locking. It returns a pointer to a net_device based on the string name.
 
struct net_device * dev_get_by_name(const char *name); 
 
We send a notification message to any interested protocols that this device is about to be destroyed by calling the notifier_call_chain
 
notifier_call_chain (&netdev_chain , NETDEV_UNREGISTER , dev ); 
 
int         register_netdev(struct net_device *dev); 
int         register_netdevice(struct net_device *dev); ->
void        unregister_netdev(struct net_device *dev); 
struct net_device * alloc_etherdev(int sizeof_priv); 
 
 
Network Interface Driver Service Functions :
 
 Linux TCP-IP 协议栈(1)——协议概览,数据链路层和驱动
The driver’s open function is called through the open field in the net device structure
int (*open) (struct net_device *dev);
Open is called by the generic dev_open function in linux /net /core /dev .c 
int dev_open(struct net_device *dev); 
 
First, dev_open checks to see if the device has already been activated by checking for IFF_UP in the flags field of the net_device structure, and if the driver is already up, we simply return a zero. Next, it checks to see if the physical device is present by calling netif_device_present, which checks the link state bits in the state field of the network device structure. If all this succeeds, dev_open calls the driver through the open field in the net_device structure.
 
Most drivers use the open function to initialize their internal data structures prior to accepting and transmitting packets. These structures may include the internal queues, watchdog timers, and lists of internal buffers. Next, the driver generally starts up the receive queue by calling netif_start_queue, defined in linux/include/linux/netdevice.h, which starts the queue by clearing the __LINK_STATE_XOFF in the state field of the net_device structure. The states are listed in Table 4.1 and are used by the queuing layer to control the transmit queues for the device. See Section 4.8 for a description of how the packet queuing layer works. Right up to the point where the queuing is started, the driver can change the device’s queuing discipline. Chapter 6 has more detail about Linux’s capability to work with multiple queuing disciplines for packet transmission queues.
 
If everything is OK, the flags field is set to IFF_UP and the state field is set to LINK_STATE_START to indicate that the network link is active and ready to receive packets. Next, dev_open calls the dev_mc_upload to set up the list of multicast addresses for this device. Finally, dev_open calls dev_activate, which sets up a default queuing discipline for the device, typically pfifo_fast for hardware devices and none for pseudo or software devices.
 
 
The set_multicast_list driver service function initializes the list of multicast addresses for the interface. 
void    (*set_multicast_list )(struct net_device *dev ); 
 
Device multicast addresses are contained in a generic structure, dev_mc_list, which allows the interface to support one or more link layer multicast addresses.  
 
struct dev_mc_list
    struct        dev_mc_list    *next;
    __u8          dmi_addr [MAX_ADDR_LEN ];
    unsigned char dmi_addrlen;
    int           dmi_users;
    int           dmi_gusers;
} ; 
 
 
The hard_start_xmit network interface service function starts the transmission of an individual packet or queue of packets
 
int (*hard_start_xmit ) (struct sk_buff *skb , struct net_device *dev )
 
This function is called from the network queuing layer when a packet is ready for transmission. The first thing the driver must do in hard_start_xmit is ensure that there are hardware resources available for transmitting the packet and there is a sufficient number of available buffers.
 
 
The change_mtu network interface service function is to change the Maximum Transmission
Unit (MTU) of a device 
int (*change_mtu )( struct net_device *dev int new_mtu ); 
 
Get_stats returns a pointer to the network device statistics 
struct net_device_stats* (* get_stats)(struct net_device *dev); 
 
 
Do_ioctl implements any devicespecific socket IO control (ioctl) functions
int (*do_ioctl )(struct net_device *dev , struct ifreq *ifr , int cmd );
 
 
Reciving Paket
 
As is the case with any hardware device driver in other operating systems, the first step in packet reception occurs when the device responds to an interrupt from the network interface hardware.
If we detect a receive interrupt, we know that a received packet is available
for processing so we can begin to perform the steps necessary for packet reception. One of the first things we must do is gather the buffer containing the raw received packet into a socket buffer or sk_buff. Most efficient drivers will avoid copying the data at this step. In Linux, the socket buffers are used to contain network data packets, and they can be set up to point directly to the DMA space.
 
Generally, network interface drivers maintain a list of sk_buffs in their private data structure, and once the interrupt indicates that input DMA is complete, we can place the socket buffer containing the new packet on a queue of packets ready for processing by the protocol’s input function. 
 
 
The netif_rx function declared in file linux/include/linux/netdevice.h is called by the ISR to invoke the input side of packet processing and queue up the packet for processing by the packet receive softirq, NET_RX_SOFTIRQ. 
The netif_rx function returns a value indicating the amount of network congestion detected by the queuing layer or whether the packet was dropped altogether. Table 4.6 shows the return values for netif_rx. 
 
 Linux TCP-IP 协议栈(1)——协议概览,数据链路层和驱动
netif_rx, is the main function called from interrupt service routines in network
interface drivers. It is defined in file linux/net/core/dev.c.
 
 
Starting with Linux version 2.4, this structure includes a copy of a pseudo net device structure called blog_dev, otherwise known as the backlog device.
 
/*
 * Incoming packets are placed on per-cpu queues so that
 * no locking is needed.
 */
struct softnet_data {
       struct Qdisc            *output_queue ;
       struct sk_buff_head     input_pkt_queue;
       struct list_head  poll_list;
       struct sk_buff          *completion_queue ;
 
       struct napi_struct      backlog;
}
 
struct napi_struct {
       /* The poll_list must only be managed by the entity which
       * changes the state of the NAPI_STATE_SCHED bit.  This means
       * whoever atomically sets that bit can add this napi_struct
       * to the per-cpu poll_list, and whoever clears that bit
       * can remove from the list right before clearing the bit.
       */
       struct list_head  poll_list;
 
       unsigned long           state;
       int                weight;
       int                (*poll )(struct napi_struct *, int);
#ifdef CONFIG_NETPOLL
       spinlock_t        poll_lock ;
       int                poll_owner;
#endif
 
       unsigned int            gro_count;
 
       struct net_device * dev;
       struct list_head  dev_list;
       struct sk_buff          *gro_list ;
       struct sk_buff          *skb;
};
以前的softnet_data 被分割为了两部分,但是实现的功能基本上是一样的
 
Backlog_dev is used by the packet queuing layer to store the packet queues for most nonpolling network interface drivers. The blog_dev device is used instead of the “real” net_device structure to hold the queues, but the actual device is still used to keep track of the network interface from which the packet arrived as the packet is processed by the upper layer protocols
 
Transmitting Packets :
 
Packet transmission is controlled by the upper layers, not by the network interface driver. 
 
That function is actually called from the packet queuing layer when there is one or more packets in a socket buffer ready to transmit. In most drivers, when it is called from the queuing layer, hard_start_xmit will put the sk_buff on a local queue in the driver’s private data structure and enable the transmit available interrupt
 
Linux provides a mechanism for device status change notification called notifier chains. 
Each location in the linked list is defined by an instance of the notifier_block structure. 
Linux TCP-IP 协议栈(1)——协议概览,数据链路层和驱动
 
 
/*
 * Notifier chains are of four types:
 *
 *    Atomic notifier chains: Chain callbacks run in interrupt/atomic
 *          context. Callouts are not allowed to block.
 *    Blocking notifier chains: Chain callbacks run in process context.
 *          Callouts are allowed to block.
 *    Raw notifier chains: There are no restrictions on callbacks,
 *          registration, or unregistration.  All locking and protection
 *          must be provided by the caller.
 *    SRCU notifier chains: A variant of blocking notifier chains, with
 *          the same restrictions.
 *
 * atomic_notifier_chain_register() may be called from an atomic context,
 * but blocking_notifier_chain_register() and srcu_notifier_chain_register()
 * must be called from a process context.  Ditto for the corresponding
 * _unregister() routines.
 *
 * atomic_notifier_chain_unregister(), blocking_notifier_chain_unregister(),
 * and srcu_notifier_chain_unregister() _must not_ be called from within
 * the call chain.
 *
 * SRCU notifier chains are an alternative form of blocking notifier chains.
 * They use SRCU (Sleepable Read-Copy Update) instead of rw-semaphores for
 * protection of the chain links.  This means there is _very_ low overhead
 * in srcu_notifier_call_chain(): no cache bounces and no memory barriers.
 * As compensation, srcu_notifier_chain_unregister() is rather expensive.
 * SRCU notifier chains should be used when the chain will be called very
 * often but notifier_blocks will seldom be removed.  Also, SRCU notifier
 * chains are slightly more difficult to use because they require special
 * runtime initialization.
 */
 
struct notifier_block {
       int (*notifier_call )(struct notifier_block *, unsigned long, void *);
       struct notifier_block * next;
       int priority;
};
 
The first of these, notifier_chain_register, registers a notifier_block with the event notification facility.
 
int notifier_chain_register( struct notifier_block ** list
struct notifier_block * n); 
 
To pass an event into the notification call chain , the function notifier_call_chain is called with apointer to the notifier_block list, n an event value , val , and an optional generic argument, v
 
int notifier_call_chain( struct notifier_block ** n, unsigned long val 
void *v); 
 
 
int register_netdevice_notifier( struct notifier_block * nb); 
 
A notification function will be called through the notifier_call field in nb for all the event typesin Table 4.7
int (*notifier_call )(struct notifier_block *self , unsigned long, void *); 
 
 

说明: 该系列学习笔记主要参考:《The Linux TCP/IP Stack Networking for Embedded Systems 》,这本书讲的思路对我来说较容易理解,只不过该书内容针对的是2.6早期内核,我在学习的时候结合2.6.34内核源码进行了学习,在学习的过程中发现了内核协议栈也有很多改变,主线基本没变,主要是内核开发人员进行了代码重构,提高了效率。在该书的思路主线下,学习笔记了主要参考2.6.34内核源码,这种形式也许有些不妥,希望能得到大家的指正和引导。