Here is an example diagram of how one might use a relatively low-powered microcontroller with an external crypto chip like ATECC608B
I haven't specified the particular hardware like Raspberry Pi, Arduino, or ESP because my primary goal is to grasp the general process first.
Sure, that makes sense. Note how my diagram was just an example –– I could draw probably half a dozen alternate ones depending on the hardware used ––, intended to help with the discussion, rather than a firm suggestion.
On the client side, there's a home automation system, which includes a microcontroller connected to a Bluetooth module, Wi-Fi module, Ethernet module, and a camera. The lights are controlled through the GPIO pins of the microcontroller via a relay driver board, and the home automation system communicates with the Wi-Fi router at my home. I believe this hardware setup is sufficient for understanding the process between server and client.
The camera puts additional restraints on your design choices, though, because of the large amount of data they produce.
As an example, consider Omnivision OV5640 five-megapixel camera modules. There are variants
suitable for some microcontrollers (that can provide say a still JPEG image to your microcontroller), but more commonly, using two-lane CSI for use with various single-board computers. (The camera module itself is supported by the Linux kernel, so it's a matter of whether the SBC hardware supports two-lane CSI or not.)
(Many 32-bit microcontrollers do support for example PSRAM for extending their address space, so that one could handle an uncompressed 5MP image in RAM. For example, my
Teensy 4.1 supports two cheap
PSRAM chips, giving me an extra
16M (16,777,216 bytes) of directly addressible RAM. The issue is that with e.g. OV5640, the alternate to CSI is a 10-bit parallel bus, which on Teensy 4.1 requires the use of FlexIO, limiting the pins one can use. Simply put, it is NOT just a matter of "I shall use
this MCU and
this camera module", even if they seem to have sufficient pins and memory and capabilities.)
However, for the purposes of understanding the communications between the various pieces, let's assume the camera produces still images or a compressed video stream and you have enough RAM on the microcontroller to handle that.
I acknowledge that TCP and WebSocket might not be the most optimal choices for this requirement, but I've chosen them due to my personal interest in exploring these options. As other suggested , Following your suggestion, I plan to add ARP and DHCP to the list.
Sure, I make such choices all the time with my own projects (developed for my own needs, as opposed to for others): it is a good thing, and keeps your motivation high, too.
I'll add a quick recap of ARP, ICMP, and DHCP after the horizontal line below.
Now, I've set up what I would call a 'Local Network,' where my router, mobile phone, and the home automation system are interconnected, allowing them to exchange data within this local network.
I'm currently trying to understanding the protocols that establish the connection between the server and my home automation system. It seems that TCP/IP establishes this connection, with WebSocket likely used to transmit data, including commands to turn on lights and record video. I am not sure so need someone's confirmation ?
Yes, that's basically how it happens.
TCP over IP is your transport protocol for Ethernet and Wifi. (The difference is the underlying transport below IP: for wired Ethernet it is IEEE 802.3 (also called 'ethernet'), and for WiFi it is one of the other IEEE 802 protocols. The other IEEE 802 protocols are designed to work seamlessly with IEEE 802.3, so for us application/device/appliance developers, we don't actually need to care.)
Some Wiznet modules and most WiFi modules you can use with microcontrollers do implement a full IP stack, and can handle the IP over Ethernet protocol details (ARP, ICMP) or support for ARP/ICMP over WiFi, provide support for both TCP and UDP protocols, and even implement a simple DHCP client. Thus, you normally only need to interface to their IP stack via the received and sent datagrams (UDP) or data streams (TCP) and possibly state changes (obtained IP address, lost network connectivity, failed to authenticate WiFi connection, et cetera), and don't need to worry what the underlying transport is.
Bluetooth uses its own set of protocols, with standard types for some use cases (like USB has USB Serial, USB Audio, and USB Video, that do not require device-specific drivers at all). I haven't used Bluetooth much myself.
In any case,
WebSocket,
HTTP, and
MQTT are
protocols, ways to format the data so that the other end correctly interpretes them. They are almost always used for formatting data sent and received over TCP/IP, but are not intrinsically tied to TCP, just to the reliability and ordering guarantees TCP gives.
You do not need to use the
same TCP/IP connection for everything. TCP/IP connections are identified by four things: Source IP address, source port, target IP address, and target port. Port is a number between 1 and 65535, inclusive, typically used to identify the service. IANA assigns port numbers (see
here), but basically you can freely let your users choose the port for the microcontroller. Typically the source port number is assigned by the operating system or IP stack, so for TCP/IP, do not require any specific source port, only the microcontroller target port, to identify the service or connection type desired, when the server initiates the connection.
When the microcontroller initiates the connection, only the server address and target port really matter.
WebSockets and HTTP typically use port 80 for unencrypted connections, and 443 for TLS-encrypted connections. MQTT default unencrypted port is 1883, and TLS-encrypted port is 8883. But these are just common conventions, and you can choose for yourself. You could use say port 15190 for the camera, port 7045 for lighting control, and so on, regardless of which protocol each uses; it is up to you.
A serious design question is whether the server makes the connection to the microcontroller, or the microcontroller to the server. Both have their benefits and downsides, and I already outlined some above; myself preferring the server-initiated approach because of the reasons I outlined. The client initiating the connection may be more common and preferred by others: I can imagine several cases where that would be better, just not this particular one. In the client-initiated connections case, the ports used on the microcontroller do not matter, of course, but then you need to configure where the microcontroller should connect –– and if you want to use hostnames instead of IP addresses and ports, you'll need to add DNS protocol (client) support to the list.
In addition, I'm confused about the communication between the home automation system and the router, both of which support Wi-Fi. I'm uncertain about the specific protocols involved and whether they utilize TCP/IP for this purpose.?
While underlying WiFi transport is one of the IEEE 802 protocols (depending on the WiFi type), they are all compatible with Ethernet, so that from a programmer's perspective it is no different to wired Ethernet connections at all.
The one exception is when the microcontroller or device uses a WiFi module directly: then, the WiFi stack needs to know the name of the
access point (the WiFi router to be used), and a password/passphrase or equivalent (depending on the WiFi security model used), and note when there is a WiFi connection established to said access point. (Depending on the WiFi implementation, this can involve some commands to be sent to the WiFi stack at specific stages, or it can do all of it automatically.) Other than that, communication over WiFi, including using DHCP to obtain an IP address, is done just like when using wired Ethernet.
When using Ethernet, WiFi, or Bluetooth, we deal with data in
packets. The hardware handles the on-wire format (possibly with some help from the IP stack for details).
For Ethernet (IEEE 802.3), the packet starts with 8 fixed bytes: 7 bytes of preamble, and a start frame byte. This is followed by the Ethernet header, consisting of the destination MAC address (6 bytes), source MAC address (6 bytes), optionally a 4-byte VLAN tag (
IEEE 802.1Q, identifying the virtual LAN the packet is part of), a 2-byte length (big-endian byte order), 46 to 1500 bytes forming an IP packet, and a 4-byte Ethernet checksum (again in big-endian byte order).
Most IP stacks strip out the Ethernet frame, giving you just the IP packet payload; and take just the IP packet, constructing the Ethernet packet around it.
For WiFi, the packet is
similar, and for communications across the access point, also contain destination and source MAC addresses. Again, most WiFi stacks strip out and construct this frame themselves, so typically we work with only the contained IP packets. (There are additional packet types for negotiating the connection, of course.)
ARP is the protocol that is used to find out the MAC addresses (
hh:
hh:
hh:
hh:
hh:
hh) of machines in the same local network; the IP address alone does not suffice. When the target IP address is not in the local network, the MAC address of the
gateway (switch or router) is used instead. Thus, each device has a limited size
ARP cache, which maps IP addresses to 6-byte MAC addresses, and vice versa. This should be internal to the IP stack you use, and not something you normally need to handle yourself.
The next step is obtaining an IP address, netmask (identifying the bits in IP addresses that must be same for the address to be within the local network), and gateway IP address (router connected to internet). These can be configured statically, obtained via DHCP protocol using UDP packets, or
link-local address autoconfiguration may be used. (The last one is simply picking a random address in the 169.254.0.0/16 IPv4 block or e80::/10 IPv6 block, with no gateway so only local network comms are possible; and using ARP to verify nobody has picked that IP address yet, and retrying until an unused IP address within that block is obtained.)
DHCP protocol consists of UDP packets (so no connection per se). There are essentially four types of packets used. First, the device sends a
discover packet using source IPv4 address 0.0.0.0 port 68 to target broadcast address 255.255.255.255 port 67. The server responds to the same MAC address with an offer, targeted to the offered IPv4 address port 67. The client is then supposed to do an ARP request to find out if any devices on the local network already have that offer, and not accept it if they do, but this normally only occurs when there are more than one DHCP server on the same local network. When a suitable offer has been received, or the client remembers it has a DHCP lease that should still be valid, the client then sends a
request packet, again using IPv4 source 0.0.0.0:68 to 255.255.255.255:67. If the DHCP server grants this request, it responds with an
acknowledgement packet, which identifies the IPv4 address to be used, gateway (the netmask is not really needed, as ARP will tell which IP addresses are accessible on the local network, and which need to be directed to the gateway), lease time in seconds (how long this address grant is valid), and addresses of DNS servers the device can query for mapping host names into IP addresses.
DHCP over IPv6 is similar, but has some different options, and of course the addresses are 128 bits long (instead of 32 as in IPv4).
When host names and not IP addresses are used, the mapping is done by a
DNS server, often with just simple UDP queries (DNS over UDP), although TCP and QUIC, and even TLS-encrypted UDP and DNS-over-HTTPS can be used with some servers. Again, this is only needed if host names instead of IP addresses are used. Also, if a query produces more than one matching result –– for example, a server may have more than one valid address –– you are supposed to connect to them in a round-robin manner, and not just always hammer the first one even if it does not answer.
If you have a DNS server or cache (like
dnsmasq,
dnscache, or even
bind) for your local network under your own control, you can use the top-level domain
.local for your local network. Such name queries are reserved for the local network. Thus, using names like
master.bedroom.local or
temperature.local or
whatever.you.want.local are perfectly allowed within your local network; and, if your computers/tablets/etc. are configured to use that DNS server while connected to the local network (requiring only the DHCP server configuration to point to this DNS server), you can use those names in your browser, too, even when external internet is also available.
ICMP is a protocol used at the IP network level. When you send a message outside the local network via the gateway, you may receive an ICMP message telling you the recipient is unavailable, for example. The IP stack should implement this transparently for you in most cases. Dropped packets are not notified about to the sender, but the unavailable notification (and other error conditions) is useful for early detection that a TCP connection cannot be made. Otherwise, it would take the TCP response timeout to fire before it would be noticed, and that timeout can be quite long (minutes instead of seconds, typically, to deal with temporary connection hiccups).