Intermediate
IoT Sensor Data for Machine Learning
IoT sensor data is the fuel for AIoT. Learning to collect, clean, and engineer features from noisy, heterogeneous sensor streams is the most critical skill in AIoT development.
Common IoT Sensor Types
| Sensor | Data Type | Sample Rate | ML Applications |
|---|---|---|---|
| Accelerometer | 3-axis acceleration | 50-1000 Hz | Activity recognition, vibration analysis |
| Temperature | Scalar (°C) | 0.1-1 Hz | Anomaly detection, HVAC optimization |
| Camera | Image/video | 1-30 fps | Object detection, quality inspection |
| Microphone | Audio waveform | 8-44.1 kHz | Keyword spotting, machine health |
| GPS | Lat/lon/alt | 1-10 Hz | Fleet tracking, geofencing |
| Current sensor | Amperage | 1-100 Hz | Energy monitoring, fault detection |
Data Collection Pipeline
Sensor Reading
Read raw values from sensors via I2C, SPI, UART, or analog pins. Use hardware timers for consistent sample rates.
Local Buffering
Buffer readings in memory or local storage. IoT devices may lose connectivity, so local buffering prevents data loss.
Transmission
Send data via MQTT (lightweight pub/sub), HTTP, or CoAP. Use message queuing to handle intermittent connectivity.
Ingestion
Cloud ingestion via AWS IoT Core, Azure IoT Hub, or Apache Kafka. Tag data with device ID, timestamps, and metadata.
Storage
Time-series databases (InfluxDB, TimescaleDB) for sensor data. Object storage (S3) for images and audio.
Preprocessing IoT Data
import pandas as pd
import numpy as np
# Load sensor data
df = pd.read_csv("sensor_readings.csv", parse_dates=["timestamp"])
# Handle missing values (common in IoT)
df["temperature"] = df["temperature"].interpolate(method="time")
# Remove outliers (sensor glitches)
q1 = df["vibration"].quantile(0.01)
q99 = df["vibration"].quantile(0.99)
df = df[(df["vibration"] >= q1) & (df["vibration"] <= q99)]
# Resample to uniform intervals (sensors may have irregular timing)
df = df.set_index("timestamp").resample("1S").mean().interpolate()
# Normalize
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["temperature", "vibration", "pressure"]] = scaler.fit_transform(
df[["temperature", "vibration", "pressure"]]
)
Feature Engineering for Sensor Data
- Rolling statistics: Mean, std, min, max over sliding windows (5s, 30s, 5min). Captures trends and variability.
- Frequency domain: FFT to extract dominant frequencies from vibration and audio data. Critical for machine health.
- Cross-sensor features: Ratios and correlations between sensors (e.g., temperature/pressure ratio).
- Time-based features: Hour of day, day of week, time since last event. Captures cyclical patterns.
- Lag features: Previous values (t-1, t-2, ..., t-n) for autoregressive models.
- Delta features: Rate of change (first derivative) and acceleration (second derivative).
Data Quality Challenges
- Sensor drift: Calibration changes over time. Periodic recalibration or drift compensation algorithms needed.
- Missing data: Network drops, battery depletion, and sensor failures create gaps. Interpolation or forward-fill, depending on context.
- Clock synchronization: Different devices may have slightly different clocks. Use NTP or GPS timestamps.
- Noise: Electrical interference, environmental vibration, and quantization noise. Use filtering (moving average, Kalman filter).
Key takeaway: IoT sensor data requires careful preprocessing: handle missing values, remove outliers, resample to uniform intervals, and engineer time-domain and frequency-domain features. Data quality is the biggest determinant of ML model performance in AIoT systems.
Lilly Tech Systems