Intermediate

IoT Sensor Data for Machine Learning

IoT sensor data is the fuel for AIoT. Learning to collect, clean, and engineer features from noisy, heterogeneous sensor streams is the most critical skill in AIoT development.

Common IoT Sensor Types

SensorData TypeSample RateML Applications
Accelerometer3-axis acceleration50-1000 HzActivity recognition, vibration analysis
TemperatureScalar (°C)0.1-1 HzAnomaly detection, HVAC optimization
CameraImage/video1-30 fpsObject detection, quality inspection
MicrophoneAudio waveform8-44.1 kHzKeyword spotting, machine health
GPSLat/lon/alt1-10 HzFleet tracking, geofencing
Current sensorAmperage1-100 HzEnergy monitoring, fault detection

Data Collection Pipeline

  1. Sensor Reading

    Read raw values from sensors via I2C, SPI, UART, or analog pins. Use hardware timers for consistent sample rates.

  2. Local Buffering

    Buffer readings in memory or local storage. IoT devices may lose connectivity, so local buffering prevents data loss.

  3. Transmission

    Send data via MQTT (lightweight pub/sub), HTTP, or CoAP. Use message queuing to handle intermittent connectivity.

  4. Ingestion

    Cloud ingestion via AWS IoT Core, Azure IoT Hub, or Apache Kafka. Tag data with device ID, timestamps, and metadata.

  5. Storage

    Time-series databases (InfluxDB, TimescaleDB) for sensor data. Object storage (S3) for images and audio.

Preprocessing IoT Data

import pandas as pd
import numpy as np

# Load sensor data
df = pd.read_csv("sensor_readings.csv", parse_dates=["timestamp"])

# Handle missing values (common in IoT)
df["temperature"] = df["temperature"].interpolate(method="time")

# Remove outliers (sensor glitches)
q1 = df["vibration"].quantile(0.01)
q99 = df["vibration"].quantile(0.99)
df = df[(df["vibration"] >= q1) & (df["vibration"] <= q99)]

# Resample to uniform intervals (sensors may have irregular timing)
df = df.set_index("timestamp").resample("1S").mean().interpolate()

# Normalize
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["temperature", "vibration", "pressure"]] = scaler.fit_transform(
    df[["temperature", "vibration", "pressure"]]
)

Feature Engineering for Sensor Data

  • Rolling statistics: Mean, std, min, max over sliding windows (5s, 30s, 5min). Captures trends and variability.
  • Frequency domain: FFT to extract dominant frequencies from vibration and audio data. Critical for machine health.
  • Cross-sensor features: Ratios and correlations between sensors (e.g., temperature/pressure ratio).
  • Time-based features: Hour of day, day of week, time since last event. Captures cyclical patterns.
  • Lag features: Previous values (t-1, t-2, ..., t-n) for autoregressive models.
  • Delta features: Rate of change (first derivative) and acceleration (second derivative).

Data Quality Challenges

  • Sensor drift: Calibration changes over time. Periodic recalibration or drift compensation algorithms needed.
  • Missing data: Network drops, battery depletion, and sensor failures create gaps. Interpolation or forward-fill, depending on context.
  • Clock synchronization: Different devices may have slightly different clocks. Use NTP or GPS timestamps.
  • Noise: Electrical interference, environmental vibration, and quantization noise. Use filtering (moving average, Kalman filter).
Key takeaway: IoT sensor data requires careful preprocessing: handle missing values, remove outliers, resample to uniform intervals, and engineer time-domain and frequency-domain features. Data quality is the biggest determinant of ML model performance in AIoT systems.