ESP32 in Production: OTA Updates, Fleet Management, and the Pitfalls Nobody Warns You About

You flash your first ESP32, it blinks an LED, and you think you understand the platform. Then you deploy forty of them across a building, push a firmware update at 2am, and watch half the fleet go dark. That’s when you actually start learning ESP32.

The hardware is dirt cheap and absurdly capable — dual-core Xtensa LX6, 520KB SRAM, Wi-Fi, Bluetooth, and a price tag under $5. But the gap between “prototype that works on your desk” and “production fleet you can update without touching a device” is enormous, and almost nobody documents it honestly.

This post is about closing that gap.

The Real Problem: OTA Is Not a Feature, It’s a System

ESP32’s OTA (Over-the-Air) update capability sounds simple: upload new firmware, device reboots, done. The reality is that OTA is a state machine with failure modes that will brick devices if you don’t design for them explicitly.

The default ESP-IDF partition table gives you two OTA slots (ota_0 and ota_1) plus a small otadata partition that tracks which slot is active. When an update succeeds, the bootloader flips to the new slot. When it fails — power cut mid-flash, Wi-Fi dropout, firmware bug that panics on boot — you need the device to recover automatically.

The mechanism exists. Most tutorials don’t explain when to actually call esp_ota_mark_app_valid_cancel_rollback(), and that omission is responsible for countless bricked production devices.

// Wrong: mark valid immediately after boot
void app_main(void) {
    esp_ota_mark_app_valid_cancel_rollback(); // Too early!
    // ... rest of init
}

// Right: mark valid only after you've proven the firmware works
void app_main(void) {
    nvs_flash_init();
    wifi_init();
    
    // Connect to MQTT broker — if this fails, we want rollback
    if (mqtt_connect() != ESP_OK) {
        ESP_LOGE(TAG, "Cannot reach broker, triggering rollback");
        esp_ota_mark_app_invalid_rollback_and_reboot();
    }
    
    // Only commit once we've proven connectivity
    esp_ota_mark_app_valid_cancel_rollback();
    ESP_LOGI(TAG, "Firmware validated, rollback cancelled");
}

CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y enables the rollback mechanism. This is a boolean, not a counter. If the new firmware doesn’t call esp_ota_mark_app_valid_cancel_rollback() before rebooting, the bootloader reverts to the previous slot on the next boot — no retry window, no second chances. The device boots the new firmware exactly once; it must prove itself or the bootloader silently reverts.

Partition Table Design Matters More Than You Think

The default partition table wastes space and limits your options. For production firmware with OTA, you want explicit control:

# partitions.csv
# Name,   Type, SubType, Offset,   Size,    Flags
nvs,      data, nvs,     0x9000,   0x6000,
otadata,  data, ota,     0xf000,   0x2000,
phy_init, data, phy,     0x11000,  0x1000,
ota_0,    app,  ota_0,   0x20000,  0x1C0000,
ota_1,    app,  ota_1,   0x1E0000, 0x1C0000,
storage,  data, spiffs,  0x3A0000, 0x60000,

The storage partition is where you keep configuration that survives firmware updates — certificates, device-specific calibration, user settings. None of this should be baked into firmware. Store it here, read it at runtime.

ESP-IDF 5.x note: SPIFFS is deprecated as of ESP-IDF 5.0. For new production deployments, prefer LittleFS (esp_littlefs) — it has better power-loss resilience (SPIFFS can corrupt on a mid-write power cut), active maintenance, and improved wear leveling. The partition layout above applies equally; just swap the SubType label and driver.

// Read device config from storage partition, not hardcoded
esp_vfs_spiffs_conf_t conf = {
    .base_path = "/spiffs",
    .partition_label = "storage",
    .max_files = 5,
    .format_if_mount_failed = false  // NEVER auto-format in production
};
esp_err_t ret = esp_vfs_spiffs_register(&conf);
if (ret != ESP_OK) {
    ESP_LOGE(TAG, "Failed to mount storage partition: %s", esp_err_to_name(ret));
    // handle error — don't proceed blindly
}

FILE *f = fopen("/spiffs/device.json", "r");
// Parse JSON config, apply to device

graph TD
    BOOT[Bootloader] --> CHECK{otadata valid?}
    CHECK -->|Yes| ACTIVE[Load active OTA slot]
    CHECK -->|No| OTA0[Load ota_0 default]
    
    ACTIVE --> VALIDATE{App calls mark_valid?}
    VALIDATE -->|Yes| RUNNING[Running Firmware]
    VALIDATE -->|No / Panic before call| ROLLBACK[Rollback to previous slot on next boot]
    
    RUNNING --> OTA_START[OTA Update Triggered]
    OTA_START --> WRITE[Write to inactive slot]
    WRITE --> VERIFY[Verify SHA256 + signature]
    VERIFY -->|Pass| COMMIT[Set new slot active, reboot]
    VERIFY -->|Fail| ABORT[Abort, remain on current]

Fleet Management: You Need a Server

ESPHome is excellent for home automation and hobby use. The moment you have more than ten devices or need audit logs of what firmware version is running where, you outgrow it.

Mender has a self-hosted Community Edition. NervesHub is Elixir-based and excellent if that’s your stack. For pure ESP-IDF shops, the simplest production-viable setup is a minimal HTTP server plus MQTT for status reporting.

Here’s the minimal Go server I use for small fleets. The checksum is computed once at startup — not per-request:

package main

import (
    "crypto/sha256"
    "encoding/hex"
    "fmt"
    "io"
    "log"
    "net/http"
    "os"
    "strconv"
)

type FirmwareServer struct {
    firmwarePath string
    version      string
    checksum     string
    size         int64
}

func NewFirmwareServer(path, version string) (*FirmwareServer, error) {
    f, err := os.Open(path)
    if err != nil {
        return nil, fmt.Errorf("open firmware: %w", err)
    }
    defer f.Close()

    stat, err := f.Stat()
    if err != nil {
        return nil, fmt.Errorf("stat firmware: %w", err)
    }

    h := sha256.New()
    if _, err := io.Copy(h, f); err != nil {
        return nil, fmt.Errorf("hash firmware: %w", err)
    }

    return &FirmwareServer{
        firmwarePath: path,
        version:      version,
        checksum:     hex.EncodeToString(h.Sum(nil)),
        size:         stat.Size(),
    }, nil
}

func (s *FirmwareServer) handleOTA(w http.ResponseWriter, r *http.Request) {
    deviceID := r.Header.Get("X-Device-ID")
    currentVersion := r.Header.Get("X-Current-Version")

    log.Printf("OTA check: device=%s current=%s latest=%s",
        deviceID, currentVersion, s.version)

    if currentVersion == s.version {
        w.WriteHeader(http.StatusNotModified)
        return
    }

    f, err := os.Open(s.firmwarePath)
    if err != nil {
        http.Error(w, "firmware not found", 500)
        return
    }
    defer f.Close()

    w.Header().Set("Content-Type", "application/octet-stream")
    w.Header().Set("Content-Length", strconv.FormatInt(s.size, 10))
    w.Header().Set("X-Firmware-Version", s.version)
    w.Header().Set("X-Firmware-SHA256", s.checksum)

    io.Copy(w, f)
}

func main() {
    srv, err := NewFirmwareServer(
        "/opt/firmware/app.bin",
        os.Getenv("FIRMWARE_VERSION"),
    )
    if err != nil {
        log.Fatalf("Failed to load firmware: %v", err)
    }

    http.HandleFunc("/ota/firmware.bin", srv.handleOTA)
    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprint(w, "ok")
    })

    log.Printf("Firmware server starting, version=%s sha256=%s", srv.version, srv.checksum)
    log.Fatal(http.ListenAndServeTLS(":8443", "cert.pem", "key.pem", nil))
}

On the device side:

static esp_err_t _http_event_handler(esp_http_client_event_t *evt) {
    switch(evt->event_id) {
        case HTTP_EVENT_ON_HEADER:
            if (strcmp(evt->header_key, "X-Firmware-Version") == 0) {
                strncpy(g_ota_version, evt->header_value, sizeof(g_ota_version) - 1);
                g_ota_version[sizeof(g_ota_version) - 1] = '\0';
            }
            break;
        default:
            break;
    }
    return ESP_OK;
}

void ota_task(void *pvParameter) {
    esp_http_client_config_t config = {
        .url = "https://ota.internal:8443/ota/firmware.bin",
        .cert_pem = (char *)server_cert_pem_start,
        .event_handler = _http_event_handler,
        .keep_alive_enable = true,  // TCP keep-alive: prevents NAT timeout on long downloads
        // Mutual TLS in production
        .client_cert_pem = (char *)client_cert_pem_start,
        .client_key_pem = (char *)client_key_pem_start,
    };
    
    esp_https_ota_config_t ota_config = {
        .http_config = &config,
    };
    
    ESP_LOGI(TAG, "Starting OTA update");
    esp_err_t ret = esp_https_ota(&ota_config);
    
    if (ret == ESP_OK) {
        ESP_LOGI(TAG, "OTA success, rebooting");
        esp_restart();
    } else {
        ESP_LOGE(TAG, "OTA failed: %s", esp_err_to_name(ret));
    }
    
    vTaskDelete(NULL);
}

ESP32 OTA update flow showing device fleet polling firmware server

Observability: MQTT + InfluxDB + Grafana

Blind devices are a nightmare. Every production ESP32 should phone home with at minimum: firmware version, uptime, free heap, Wi-Fi RSSI, and the last reset reason.

typedef struct {
    char device_id[32];
    char firmware_version[16];
    uint32_t uptime_seconds;
    uint32_t free_heap;
    int8_t wifi_rssi;
    esp_reset_reason_t last_reset_reason;
    float temperature_c;
} device_telemetry_t;

void publish_telemetry(void) {
    wifi_ap_record_t ap_info = {};
    esp_wifi_sta_get_ap_info(&ap_info);

    device_telemetry_t telemetry = {
        .uptime_seconds = esp_timer_get_time() / 1000000,
        .free_heap = esp_get_free_heap_size(),
        .wifi_rssi = ap_info.rssi,
        .last_reset_reason = esp_reset_reason(),
    };
    
    strncpy(telemetry.device_id, g_device_id, sizeof(telemetry.device_id) - 1);
    telemetry.device_id[sizeof(telemetry.device_id) - 1] = '\0';
    
    strncpy(telemetry.firmware_version, FIRMWARE_VERSION,
            sizeof(telemetry.firmware_version) - 1);
    telemetry.firmware_version[sizeof(telemetry.firmware_version) - 1] = '\0';
    
    char json[512];
    snprintf(json, sizeof(json),
        "{\"device_id\":\"%s\",\"firmware\":\"%s\","
        "\"uptime\":%" PRIu32 ",\"free_heap\":%" PRIu32 ","
        "\"rssi\":%d,\"reset_reason\":%d}",
        telemetry.device_id,
        telemetry.firmware_version,
        telemetry.uptime_seconds,
        telemetry.free_heap,
        telemetry.wifi_rssi,
        telemetry.last_reset_reason);
    
    int msg_id = esp_mqtt_client_publish(mqtt_client,
        "devices/telemetry", json, 0, 1, 0);
    if (msg_id < 0) {
        ESP_LOGW(TAG, "Telemetry publish failed");
    }
}

Wire this to a Telegraf MQTT consumer → InfluxDB → Grafana and you get fleet-wide visibility. Alert on free_heap < 50000 (stack creep), reset_reason == 3 (panic), or devices that haven’t checked in for 10 minutes.

flowchart LR
    subgraph Fleet["ESP32 Fleet"]
        D1[Device 01]
        D2[Device 02]
        DN[Device N...]
    end
    
    subgraph Infra["Infrastructure"]
        MQTT[Mosquitto MQTT Broker]
        TEL[Telegraf Consumer]
        INFLUX[(InfluxDB)]
        GRAFANA[Grafana Dashboard]
        OTA_SRV[OTA HTTP Server]
        ALERT[Alertmanager]
    end
    
    D1 & D2 & DN -->|telemetry/status| MQTT
    MQTT --> TEL --> INFLUX --> GRAFANA
    GRAFANA --> ALERT
    D1 & D2 & DN -->|poll for updates| OTA_SRV

The Benchmarks That Actually Matter

Metric	ESP32 (LX6, 240MHz)	ESP32-S3 (LX7, 240MHz)
SHA256 1KB (software)	~2.1ms	~1.8ms
SHA256 1KB (hardware accel)	~0.3ms	~0.2ms
TLS handshake (ECDHE-RSA)	~2.8s	~2.1s
MQTT publish (QoS 1)	~15ms	~12ms
OTA 1MB firmware	~45s @ Wi-Fi	~38s @ Wi-Fi
Free heap (fresh boot)	~250KB	~320KB
Stack per FreeRTOS task	2–8KB typical	2–8KB typical

The ESP32-S3 uses Xtensa LX7 cores versus the original ESP32’s LX6 — the core upgrade is why you see gains across the board, not just clock headroom.

TLS handshake time is the killer for battery-powered devices. For devices that sleep between transmissions, keep_alive_enable won’t help you — that’s TCP keep-alive, which only prevents idle connection teardown. It has no effect when the device sleeps and disconnects entirely. The real options for cutting handshake overhead are TLS session tickets (requires server-side support; mbedTLS handles this) or pre-shared keys with a faster cipher suite like TLS_PSK_WITH_AES_128_CCM_8. The PSK approach trades key distribution complexity for a handshake measured in milliseconds instead of seconds.

When to Use It, When to Reach for Something Else

Use ESP32 when:

You need Wi-Fi + compute at under $5/unit
Battery life isn’t critical (or you’re on mains power)
Your fleet is under ~1000 units and you control the network
You’re comfortable with C/C++ or MicroPython

Reach for something else when:

You need cellular connectivity — look at the ESP32 with a SIM7670 module, or switch to a Quectel module with a proper cellular stack
Sub-GHz range matters — the nRF52840 + LoRa combo covers 2km+ where Wi-Fi gives up
You need real-time guarantees — FreeRTOS ticks at 100Hz by default, the watchdog is your friend but you’re not running a motor controller on this thing
Your update pipeline needs cryptographic firmware signing with an HSM — doable on ESP32 (secure boot v2 + flash encryption), but the key ceremony is painful and irreversible if you make mistakes

The One Config You Must Get Right Before Shipping

Enable secure boot v2 and flash encryption before your first production flash. Once devices are in the field, you cannot retroactively enable these. The public key digest is burned into eFuses on first boot — the private key never touches the device. Keep your private signing key in a secrets manager or HSM. If it’s compromised, every device trusting that key is compromised.

The threat model is precise: eFuse holds the public key digest → bootloader rejects any binary not signed by the matching private key → an attacker with physical flash access cannot install unsigned firmware.

The following is a reference configuration fragment, not a literal filename ESP-IDF reads. Apply these via sdkconfig.defaults or a Kconfig fragment in your build system:

# Reference: production sdkconfig settings
CONFIG_SECURE_BOOT=y
CONFIG_SECURE_BOOT_V2_ENABLED=y
CONFIG_SECURE_SIGNED_APPS_ECDSA_V2_SCHEME=y
CONFIG_SECURE_BOOT_SIGNING_KEY="secure_boot_signing_key.pem"

CONFIG_FLASH_ENCRYPTION_ENABLED=y
CONFIG_FLASH_ENCRYPTION_MODE_RELEASE=y

CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y
CONFIG_BOOTLOADER_APP_ANTI_ROLLBACK=y
CONFIG_BOOTLOADER_APP_SECURE_VERSION=1

# Watchdog: if main task hangs for 30s, reset
CONFIG_ESP_TASK_WDT_TIMEOUT_S=30
CONFIG_ESP_TASK_WDT_CHECK_IDLE_TASK_CPU0=y

Generate your signing key once, store it in a secrets manager, and never let it touch a developer’s laptop. Every firmware binary gets signed before it hits your OTA server. The bootloader verifies the signature before loading.

ESP32 secure boot chain showing key hierarchy and verification steps

The ESP32 ecosystem has matured considerably. The tooling (ESP-IDF 5.x, the VS Code extension, idf.py workflows) is genuinely good. The documentation is dense but complete. What it lacks is honest writing about production failure modes — the devices that got bricked by a bad OTA, the fleet that came online with debug logging enabled and hammered the broker, the secure boot key that got lost.

Design for failures you haven’t seen yet. That’s what separates a prototype from infrastructure.