Software engineering for dummies

As can be inferred from various posts on this blog, I'm not too clever with networking... yet.

I'm discovering that part of my confusion comes from imprecise terminology. I blame this on my pedantic, mathematical mind.

First: there's no such thing as a TCP packet.

Okay, that's probably too pedantic. But as Wikipedia points out:

The term TCP packet appears in both informal and formal usage, whereas in more precise terminology segment refers to the TCP protocol data unit (PDU), datagram[5]to the IP PDU, and frame to the data link layer PDU.

StackOverflow:

Segments are units of data in the Transport Layer (TCP/UDP in case of the Internet)
Packets or Datagrams are units of data in the Network Layer (IP in case of the Internet)
Frames are units of data in the Link Layer (e.g. Wifi, Bluetooth, Ethernet, etc).

Next: what is a socket?

Say you have a web server listening on port 80. When it gets a new incoming connection, does it have to create a new socket? Or is there just one socket (represented by localhost:80)?

From the RFC:

    To allow for many processes within a single Host to use TCP
    communication facilities simultaneously, the TCP provides a set of
    addresses or ports within each host.  Concatenated with the network
    and host addresses from the internet communication layer, this forms
    a socket.  A pair of sockets uniquely identifies each connection.
    That is, a socket may be simultaneously used in multiple
    connections.

    The binding of ports to processes is handled independently by each
    Host.  However, it proves useful to attach frequently used processes
    (e.g., a "logger" or timesharing service) to fixed sockets which are
    made known to the public.

...

    To provide for
    unique addresses within each TCP, we concatenate an internet address
    identifying the TCP with a port identifier to create a socket which
    will be unique throughout all networks connected together.

It sure sounds like a socket is an IP address + TCP port, and your web server is merely seeing a new connection on the same socket. Similarly, here's Stevens in his authoritative book:

A TCP connection is defined to be a 4-tuple consisting of two IP addresses and two port numbers. More precisely, it is a pair of endpoints or sockets where each endpoint is identified by an (IP address, port number) pair.

TCP-IP Illustrated Volume 1, W. Richard Stevens

On the other hand, when you go to write your web server, you discover you need the accept function / system call:

https://www.gnu.org/software/libc/manual/html_node/Accepting-Connections.html

A socket that has been established as a server can accept connection requests from multiple clients. The server’s original socket does not become part of the connection; instead, accept makes a new socket which participates in the connection. accept returns the descriptor for this socket. The server’s original socket remains available for listening for further connection requests.

http://man7.org/linux/man-pages/man2/accept.2.html

The accept() system call is used with connection-based socket types
(SOCK_STREAM, SOCK_SEQPACKET). It extracts the first connection
request on the queue of pending connections for the listening socket,
sockfd, creates a new connected socket, and returns a new file
descriptor referring to that socket. The newly created socket is not
in the listening state. The original socket sockfd is unaffected by
this call.

Etc.

So, your server must create a new socket, but the RFC says that "we concatenate an internet address with a port identifier to create a socket which will be unique." If your new socket has the same IP address and port as the web server (localhost:80), that would violate uniqueness. So clearly it must... get a new port number? Does the client then still think it's connected to port 80, and if so, who's doing the translating between 80 and your new port?

It turns out that "socket" is overloaded:

The distinctions between a socket (internal representation), socket descriptor (abstract identifier), and socket address (public address) are subtle, and these are not carefully distinguished in everyday usage. Further, specific definitions of a "socket" differ between authors and often refers specifically to an internet socket or TCP socket.

(Wikipedia)

So maybe a good mental model is that there is just one socket, but each new connection gets a new file descriptor for it:

     newsockfd = accept(sockfd, 
                 (struct sockaddr *) &cli_addr, 
                 &clilen);

Unfortunately that doesn't quite work. Per the man page, accept "creates a new connected socket, and returns a new file descriptor referring to that socket." The socket isn't just the file descriptor, and it's not the thing from the RFC. It's something else.

This kind of thing confounds me endlessly. I have two conflicting mental models of how things work, without realizing it. While working on any one component of a system, I'm using one of the models, and then at some unrecognized later point I've switched. And when I'm forced to reconcile them (since both are from authoritative sources) I generate an explanation that's just plain wrong.

Software engineers are not mathematicians. Apparently, neither are physicists. Oh well.

Somehow, every time I need to set up an SSH tunnel -- or even figure out whether it's applicable to my problem -- I have to re-learn the concept. And every time, I have to read multiple tutorials, none of which seem to explain very clearly what they are or what they do.

The author of this blog seems to feel the same way, and he does a great job explaining it. I'd like to summarize it here, through examples.

The first thing to understand is that it's a way to open a TCP connection that you normally cannot (because the server or port is blocked by a firewall, for example). Remarkably, almost none of the guides online explain this clearly.

Next: it requires an SSH daemon to be running on a remote machine that you can connect to (which we'll call remote). The local machine will be called local.

There are two modes of use: local (-L) and reverse (-R). Let's look at some examples.

Local mode

This is for when your local machine cannot access something that the remote machine can. So we create a tunnel where a local port can be used to access that something.

ssh -L 9001:yahoo.com:80 remote

This is instructing remote to connect to yahoo.com:80, and make it available via localhost:9001.

ssh -L 2000:localhost:5900 remote

This one is a bit confusing. Here, remote connects to its own localhost:5900, and make that connection available via our localhost:2000. Remember that the bolded TCP address is relative to remote in this mode.

Port 5900 is used for VNC (a way to share desktops). Normally, instead of using port 2000, we'd use 5900. But having two 5900s in the command makes it more confusing, so I've avoided this.

Reverse mode

This is for when your local machine can access something that the remote cannot. This flips around the meaning of the parameters.

For example, if my local machine is on my work intranet, and the remote (at home, say) can access my local machine, I might do this:

ssh -R 9001:intra-site.com:80 remote

Now intra-site.com:80 is relative to the local machine, and 9001 is the port on remote that people will connect to to access it.

Under the hood

What's going on under the hood in the local and reverse cases? Here's my understanding. In both modes, there's a persistent connection between localhost and remote:22 (where its sshd is running).

Local

ssh -L 2000:yahoo.com:80 home

This creates a local process listening on port 2000. When something connects to port 2000, it instructs home to open a connection to yahoo.com:80.

Remote

ssh -R 9001:intra-site.com:80 home

Now home has a process listening on 9001, and whenever something connects there, the local machine opens up a fresh connection to intra-site.com:80.

In either case, there is one connection that remains open (the "tunnel"), and one that opens in response to incoming requests.

Software engineering for dummies

Wednesday, January 17, 2018

Confusing networking terminology

SSH tunneling for dummies

Local mode

Reverse mode

Under the hood