Wednesday, January 17, 2018

Confusing networking terminology

As can be inferred from various posts on this blog, I'm not too clever with networking... yet.

I'm discovering that part of my confusion comes from imprecise terminology. I blame this on my pedantic, mathematical mind.


First: there's no such thing as a TCP packet.

Okay, that's probably too pedantic. But as Wikipedia points out:
The term TCP packet appears in both informal and formal usage, whereas in more precise terminology segment refers to the TCP protocol data unit (PDU), datagram[5]to the IP PDU, and frame to the data link layer PDU.
StackOverflow:

  • Segments are units of data in the Transport Layer (TCP/UDP in case of the Internet)
  • Packets or Datagrams are units of data in the Network Layer (IP in case of the Internet)
  • Frames are units of data in the Link Layer (e.g. Wifi, Bluetooth, Ethernet, etc).


Next: what is a socket?

Say you have a web server listening on port 80. When it gets a new incoming connection, does it have to create a new socket? Or is there just one socket (represented by localhost:80)?

From the RFC:

    To allow for many processes within a single Host to use TCP
    communication facilities simultaneously, the TCP provides a set of
    addresses or ports within each host.  Concatenated with the network
    and host addresses from the internet communication layer, this forms
    a socket.  A pair of sockets uniquely identifies each connection.
    That is, a socket may be simultaneously used in multiple
    connections.

    The binding of ports to processes is handled independently by each
    Host.  However, it proves useful to attach frequently used processes
    (e.g., a "logger" or timesharing service) to fixed sockets which are
    made known to the public.
    
    ...
    To provide for
    unique addresses within each TCP, we concatenate an internet address
    identifying the TCP with a port identifier to create a socket which
    will be unique throughout all networks connected together.

It sure sounds like a socket is an IP address + TCP port, and your web server is merely seeing a new connection on the same socket. Similarly, here's Stevens in his authoritative book:
A TCP connection is defined to be a 4-tuple consisting of two IP addresses and two port numbers. More precisely, it is a pair of endpoints or sockets where each endpoint is identified by an (IP address, port number) pair. 
TCP-IP Illustrated Volume 1, W. Richard Stevens 

On the other hand, when you go to write your web server, you discover you need the accept function / system call:

https://www.gnu.org/software/libc/manual/html_node/Accepting-Connections.html
A socket that has been established as a server can accept connection requests from multiple clients. The server’s original socket does not become part of the connection; instead, accept makes a new socket which participates in the connection. accept returns the descriptor for this socket. The server’s original socket remains available for listening for further connection requests.
http://man7.org/linux/man-pages/man2/accept.2.html
The accept() system call is used with connection-based socket types
(SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection
request on the queue of pending connections for the listening socket,
sockfd, creates a new connected socket, and returns a new file
descriptor referring to that socket. The newly created socket is not
in the listening state. The original socket sockfd is unaffected by
this call.
Etc.

So, your server must create a new socket, but the RFC says that "we concatenate an internet address with a port identifier to create a socket which will be unique." If your new socket has the same IP address and port as the web server (localhost:80), that would violate uniqueness. So clearly it must... get a new port number? Does the client then still think it's connected to port 80, and if so, who's doing the translating between 80 and your new port?

It turns out that "socket" is overloaded:
The distinctions between a socket (internal representation), socket descriptor (abstract identifier), and socket address (public address) are subtle, and these are not carefully distinguished in everyday usage. Further, specific definitions of a "socket" differ between authors and often refers specifically to an internet socket or TCP socket. 
(Wikipedia) 
So maybe a good mental model is that there is just one socket, but each new connection gets a new file descriptor for it:

     newsockfd = accept(sockfd, 
                 (struct sockaddr *) &cli_addr, 
                 &clilen);

Unfortunately that doesn't quite work. Per the man page, accept "creates a new connected socket, and returns a new file descriptor referring to that socket." The socket isn't just the file descriptor, and it's not the thing from the RFC. It's something else.

This kind of thing confounds me endlessly. I have two conflicting mental models of how things work, without realizing it. While working on any one component of a system, I'm using one of the models, and then at some unrecognized later point I've switched. And when I'm forced to reconcile them (since both are from authoritative sources) I generate an explanation that's just plain wrong.

Software engineers are not mathematicians. Apparently, neither are physicists. Oh well.

No comments:

Post a Comment