Your own Cloud-IoT DIY project. Part 4: Cloud backend.

16 min readAug 20, 2023

This is the forth part of “Your own Cloud-IoT DIY project”. Previous part can be found here.

TL;DR: AWS is cloud provider of choice. Infrastructure is fully coded with AWS CDK. The solution is 100% serverless and follow AWS six pillars of well architected framework (brief design review available). Cognito is Identity Provided of choice.

General principles

For simplicity and cost effectiveness of Cloud backend (DIY!) we’ll use only serverless AWS resources and rely on multiple AWS services including:

AWS IoT Device Management for devices registry and monitoring (for enterprise system this is typically not an option as multi-tenant access with complicated RBAC is required)
REST API to serve UI and devices using MTLS (mutual TLS).
AWS Cognito to serve as our OAuth 2.0 IDP (Identity Provider) to host our login page (for enterprise system you’ll typically need more complicated solution).

The good thing though is that all of the infrastructure is described with CDK so you don’t need to configures anything manually! Instead everything will be deployed and configured with just one (literally) command. Even if not familiar with cloud stuff you can still deploy and use it.

Let’s briefly overview components of this cloud design.

Cloud design components

AWS IoT Core

Let’s first clarify the confusion about “Data Plane”:

AWS has different “planes” used to support more complicated functionality (Greengrass, device shadows, etc.). The plane which is NOT used by AWS additional services is “AWS Data Plane”. This is just basic MQTT connection.
We are using “AWS Data Plane” as a transport (MQTT) for our “application” protocols including “Control Plane”, “Status Plane” and “Data Plane”. Don’t be confused — in our project “Data Plane” is used as a part of our application layer, in AWS documentation “Data Plane” may be used for describing “MQTT transport” functionality.

We’re using multiple IoT Core functionalities including MQTT broker, Devices provisioning, Devices registry, Messages flow control, etc.

Device X509 certificates

We’ll be using AWS IoT Core service to generate things X509 certificates as this offloads a lot of tasks to cloud provider and more secure for DIY.
This is typically not an option for enterprise-level solution where you better have and control your own Certificates chain with ability to invalidate branches of that chain. Here is a good white-paper about handling certificates with respect to manufacturing supply chain and provisioning.

IoT Rules

Using IoT Rule with actions can be a really good alternative to simple Lambda Function execution for every raw message arrived.

S3 Bucket action writes the MQTT data (potentially augmented by Rule) directly to S3 Bucket. This action opens the opportunity to inject data without spending money on Lambda execution. It our project we’re using this action as a main approach for telemetry injection and status logging.

SQS Queue target writes the data directly to the queue which enables a functionality of batch handling. I.e. instead of invoking Lambda for every message received you can handle bunch of injected messages. We do not use this action for now but it can be the option if number of status message will be high.

There are other actions available and some of them can be extremely helpful for any IoT application.

We’re also using “Basic Ingest” to reduce the cost of out messaging.

All IoT Rules are coded as a part of the infrastructure code (available at Cloud_IoT_DIY_cloud section of repo) with the help of AWS CDK iot actions alpha version (aws_iot_alpha). Usage of alpha version is not suitable for Enterprise systems but hopefully this functionality will be included to nearest GA release of AWS CDK.

Storage

Infrastructure code for persistent storage is not complicated in general but it has some important differences:

resource behavior when Stack destroyed. Obviously we don’t want to loose collected data due to some accidental stack destroy process. But at the same moment we don’t want to go and manually destroy all persistent storage resources manually when infrastructure in active development stage. To cover this RemovalPolicy of our resources depends on is_dev_stack parameter.
resource behavior when core parameters are changed. In brief, the issue is that when you’re changing some core parameter of the resource (like partition key of the DynamoDB Table) the only way to enforce that will be replace your current resource with the new updated one. This is potentially even more dangerous feature of ‘infrastructure as a code’ then accidental stack destroy as you cannot be sure what parameters will be treated as core. The good thing though is that this ‘feature’ can be effectively blocked with useful by itself approach of ‘custom naming’ of your resources. I.e. AWS will not replace your resources if instead of ‘recommended’ resource names autogeneration by AWS CDK you’ll use your own proprietary names. The only thing you should keep in mind — resources must have unique names. And some resources (like S3 Buckets) must have globally unique names. Name uniqueness can be achieved with different approaches (see details in our infrastructure code available at Cloud_IoT_DIY_cloud section of repo in the cloud_iot_diy_cloud_stack.py stack definition).

Data storage

S3 will be a main data storage in our system. And we’ll be using “S3 Standard Tier” for now. In general, switching to “S3 Intelligent-Tiering” can make sense, however for now we’ll better have more predictable cost (even if it’ll be slightly more expensive”. However we’ll use “Data Lifecycle” rules to move/remove data to avoid high bills for storing “orphan” data.

All S3 Buckets are private (no any public access) and encrypted. Data in that buckets can be accessed/updated with APIs or IoT Core.

Each S3 Bucket will have it’s own Life Cycle Management Rules:

Telemetry Storage Bucket — all items will expire in 3 days
The reason for that is that items will be aggregated to historical data daily so 3 days will give us an interval to react on potential aggregation process failure
Historical data Storage Bucket — data never expire but will be moved to
INFREQUENT_ACCESS after two years.
Internal service storage Bucket — versioned bucket with 5 versions — 30 days retention.
Static web-site Bucket — data never expired and not versioned

Other details on Bucket properties can be found in our infrastructure code (available at Cloud_IoT_DIY_cloud section of repo in the cloud_iot_diy_cloud_stack.py stack definition).

NOTE that for Enterprise systems S3 may be not suitable due to larger latency especially for BigData. In this case S3 Buckets (at least some of them) can be replaced with properly partitioned DynamoDb Tables. This will be more fast but much more expensive.

NoSQL storage

DynamoDb is minimally used in our system (as it can be expensive for DIY). The only Table is used to track complicated IoT sessions state. Data in the Table will be encrypted (with AWS managed keys) and billing mode will be PAY_PER_REQUEST. For that Table we’ll be using composite primary key with:

ThingId (Thin name in our scenario) as a partition key
SessionId (generated uuid for any session) as a sort key

Combination of these keys is globally unique and will form the primary key for any record. Each record will have also timestamp and other attributes required for session control.

REST APIs

APIs organization

Two APIs are defined on API GW— for UI and for devices. API definition is 100% coded including all required additional resources (Route53 records, Certificates, Authorizers, etc).

UI API has 4 resources (endpoints) for now to support data collection and communication with devices. All endpoints are protected with AuthN/Z.

devices — GET to collect devices list
devices/{device_id} — GET to collect device details
devices/{device_id}/telemetry — GET to collect telemetry (last data)
devices/{device_id}/historical — GET to collect historical data
devices/{device_id}/command — POST to send a command to device
devices/{device_id}/command/{session_id} — GET session info

Note that for now we do not support device configuration over REST API. Instead MQTT from device and AWS Console UI can be used. This is probably unacceptable for enterprise systems where one of requirements typically is device configuration with proprietary UI. However this requirement (if you need so) can be implemented easily by extending REST API with extra resources and methods.

Devices MTLS API created for device OTA updates and other situations when REST API usage it’s reasonable. This API relies on Mutual TLS for client identification as provisioned devices already have certificates issued by trusted authority. However for MTLS API to be viable extra step is required in the device provisioning (MTLS API trsuted store needs to be updated and API domain reinitialized). In our system this step is a responsibility of Status command Lambda.

AuthN/Z

AutheNtication and AuthoriZation are typically less covered topics in any POC/DIY projects. There are multiple reasons for that — it’s complicated, resource consuming, not demisable, etc. However without right AuthN/Z any project can be treated as ‘completed’.

In our system we’ll rely on AWS Cognito as IDentity Provider. Cogntio has a lot of features and can be used as standalone OAuth IDP, proxy for SSO implementation, access control to AWS resources, etc. We’ll be using:

Cognito User Pool — for the sake of simplicity users in our system will be created with AWS Console. This is a serious simplification (no ‘Admin portal’) and functionality limit but seems Ok for DIY systems where you probably have just few users reflecting your family and friends. For Enterprise multi-teant systems this is obviously not and you’ll need to create a separate UI for user management.
Cognito App Client — for now just an app for web application will be created.
Cognito Authorizer for API on API GW — API GW provides simplified way of defining Authorizers when using Cognito. Instead of creating Authorizer Lambda for token validation and AuthN/Z decision you can just use predefined ‘Cognito Authroizer’. This makes infrastructure code somewhat simpler but requires correct configuration of resource servers and scopes.
Cognito OAuth 2.0 authorization code grant flow with hosted UI for end users authentication

One of the very important benefit of Cognito for our project is that it can also be coded as part of the infrastructure! I.e. you do not need to separately deploy and configure an IDP for your project. Instead it’ll be a service instantiated and configured as part of project infrastructure deployment. You can find details in the infrastructure code (available at Cloud_IoT_DIY_cloud section of repo in the cloud_iot_diy_cloud_stack.py stack definition).

For the sake of simplicity we do not use other helpful Cognito features like user sign-up flow, auth with known providers (Google, Apple, etc.), etc.

Cognito is often ignored in Enterprise systems as own IDP is available and requirement is to support it. However even in ‘own IDP’ scenario Cognito can be helpful as a proxy which can help implementing SSO pattern.

Computational resources

Lambdas definitions

In general defining lambda function with AWS CDK is quite straightforward except number of available parameters. You should review parameters and their default values when defining your function.

Some important properties for all Lambdas (code available at Cloud_IoT_DIY_cloud section of repo):

architecture — ARM_64. AWS default is X86_64 as it’s more universal. But ARM_64 is 30% less expensive and if you’re not going to use dependencies which require X86 is obviously a better choice.
log_retention — ONE_WEEK…ONE_YEAR but always explicitly defined. AWS default in INFINITE which often results in tons of garbage logs collected during development living in CloudWatch and gradually increasing price for pennies to hundreds of dollars. NOTE that adding log_retention results in deployment of some extra ‘service’ lambda functions (defined and maintained by cdk) which will run on some schedule and clean up your logs for you.
runtime — PYTHON_3_9. We’ll be using Python as language of choice for our Lambda functions.
handler — “lambda_code.lambda_handler”. All our Lambda functions sources will be organized in standard way including dedicated ‘lambda_handler’ function defined in the ‘lambda_code.py’.

Lambda functions names follow naming convention which is simplified version of normal one. As it was mentioned before we do not include multiple stages (like dev/qa/prod/etc) in our project so we do not need to include stage identifier into the function name (if needed this can be easily added as a stack parameter which will add prefix/suffix to function name).

So we’ll use these names for the folders and as a base for resource name:

All Lambda functions for API endpoints are named by template:
“api_<ui/mtls>_<resource path>_<http_method>” for example:
“api_ui_devices_deviceid_get”
All other functions are named by template
“<trigger type>_<brief description>” for example:
“scheduled_telemetry_aggregation”

NOTE that to avoid name conflicts (clashes) each Lambda Function resource name will be extended with Construct/Stack Id. So if you’ll need to deploy multiple projects (like stages) into the account you can just reflect the stage in the StackId (name). The only issue with this approach is function name length limit of 64 characters. We can either keep this limit in mind when choosing function and stack name or use stack name hash to fit into that limit.

Lambda Parameters

Typically your Lambda requires some information about environment it’s running in. You need a DynamoDb Table name and primary key info, name of some S3 Buckets, etc.

The best-practice is never code this type of parameters but instead collect them at run time. And here comes the question where to store that parameters. There are multiple options:

Lambda Environment Variables is one of the most popular free way of propagating parameters. You can add that vars at or after the deployment and access then using standard os environment sdk. There two issues to keep in mind when using environment variables: you should never ever store sensitives there and you implicitly make each your function vulnerable to accidental misbehavior if it’s environment will be changed while other functions not. I.e. if you change the value of environment variable (with console for example) for one function but not for another related one the overall system behavior will be incorrect. This type of bugs are hard to identify.
API Stage Variables less popular but extremely efficient free way of propagating common parameters to lambdas serving API endpoints. One of the biggest benefit is single source of your configuration for all API endpoints. But you also should never ever store sensitives there.
SSM Parameter Store is simple and inexpensive AWS resource with functionality of propagating parameters including sensitive ones.
Secrets Manager is more feature reach (including rotation, extra access control and audit) but more expensive way to store sensitive information.

When deciding where to store your parameters you should review all of the options and that options limits.

In our project we’ll be using Stage and Environment variables only as we don’t have sensitives to be stored as function parameter (access will be controlled with IAM roles and permissions).

Lambda Layers

Lambda layers is on of the awesome AWS features which helps you to organize your code in an efficient way and make your deployment packages smaller.

In general the Layer is just a piece of code which is handled by AWS separately and can be exposed to Lambda runtime if required. Effectively this means that if some of your functions rely on common components you can isolate those components into layer, deploy the layer separately and reference required components in your function code.

So the Lambda Layers can be very effective but require some specific actions during build and specific approach when referencing is code (for some languages).

UI hosting

Standard pattern is used with:

UI (Single Page Application) code hosted in the private S3 bucket
AWS CloudFront provides TLS 1.2 protected access
Certificate and DNS Records (Route53) are created as a part of infrastructure code

UI code is updated as a part of cloud backend deployment. This means that we can work on UI code locally. When ready we’ll just need to build it and run cdk deploy to update the code in the cloud (this will efficiently upload the new UI code and invalidate CloudFront cache).

Detail on UI (including Authentication process) will be covered in the later sections.

Other

AI components will be discussed in details in later sections.

Notification service (SES/SNS) is out of the picture as it’s optional and is not included to design for now. However Cognito native functionality is available for user verification. It’s limited but is more than enough for DIY.

AWS CDK infrastructure code

CDK code available in GitHub repository makes deployment of this complicated (and maybe even scary) cloud infrastructure very simple.

Almost complete cloud design diagram (**provisioning, dashboards and alarms are not covered**)

Repository has README.md file with usage instructions but in general you can deploy the whole project (including web-site content) with just one command (cdk deploy) as soon as it’s properly configured.

Backend Code

Lambdas code

One of the important principle when designing serverless backend is separation of concerns — each Lambda function “do one thing but do it well”.

So in general it’s a good idea to have a lambda function per each resources/method unless lambdas are almost identical (like query on one parameter and query on multiple parameters).
In our project each resources/method combination is implemented with dedicated function. This may look as an overkill but in reality this approach (true microservices) helps keep Lambda code simple and manageable.

Python is the language of choice for our functions. When choosing the language it’s a good idea to take into consideration a typical cold start time. Cold start issue has multiple solutions and can be easily resolved with minor cost increase. But if you can choose the language it’s maybe a good idea to not have that issue at all. In general it’s better to use Python, NodeJS or TypeScript.

So all our Lambda functions are coded with Python and located in the subfolders within main cloud section Cloud_IoT_DIY_cloud/src. All Lambdas can be build and prepared for deployment with helper script lbuild.py. This tool can also deploy the infrastructure for you (and is used by bootstrapping tool described in the section ‘Helper tools and project config’). Most of the Lambdas are serving API endpoints (REST resources).

Standard endpoint code structure follows another best-practice: keep AWS-specific part (handler) separate from the main logic. With this approach main logic can be easily tested locally and easily portable if we’ll want to migrate to Azure or GCP.

Another important approach is abstraction of AWS resources in use. For example — instead of direct usage of boto3 to access S3 we have an abstraction layer (ObjectsDatasource) which is used across Lambdas. So if we’ll need to migrate to another storage or even cloud that abstraction layer will be the only part to update. Each abstraction layer will have common interfaces defined and factory method to create cloud-specific implementation.

Term ‘layer’ can be confusing as it’s used also for AWS resource. To avoid any confusion the Lambda Layer (cloud resource) will be always capitalized why abstraction layer (software design pattern) is always low-case.

As it was already mentioned Lambdas can be build with lbuild.py. In general this tool collect requirements for each lambda and packed it with lambda sources. Important point here is that each lambda folder can have it’s own requirements.txt file with this particular Lambda dependencies. So you can use top-level .lambda_venv and requirements.txt to work locally and install all requirements for all lambdas in one place but when build for deployment your lambda will use minimal required list (if provided) to have deployment package small. This is really important as boto3 for example should be installed locally but not included into deployment package as it’s available by default in any Python Lambda runtime.

Lambda Layers code, build and usage

Lambda Layers are located in the same Cloud_IoT_DIY_cloud/src but in the dedicated subfolders with ‘_’ prefix (for example Cloud_IoT_DIY_cloud/src/_nosql_datasource). Lambda Layers can also be build with lbuild.py, but the process is slightly different as Layer deployment package must have particular subfolders structure.

With Python referencing Layers is quite simple if you follow folders structure described above. You can just import for _<layer-name> and it’ll work locally and in cloud.

Note that for other languages Layer referencing maybe a little bit more tricky. For example if using TS you may need to use paths option in your tsconfig to effectively manage Layer dependencies.

In our project we’ll put our resources abstraction (layers) into Layers and have:

NoSqlDatasource in _nosql_datasource
ObjectsDatasource in _objects_datasource

Cloud design review

Any cloud system should be reviewed in accordance with some basic design principles. AWS published long time ago the very good concept of “Five pillars of well architected framework” which was extended with sixths pillar some year ago — AWS doc. I’ve collected five pillars on the picture and use it as a reference when/if cloud system review performed.

Full cloud design review for our project will take too much time and place so we’ll just briefly overview some basic design review points.

Efficiency, reliability and security of our cloud backend

Security brief overview:

all our data is encrypted on rest and in transit (TLS 1.2 is enforced for data transfers and encryption for storages).
all REST API endpoints are protected with either JWT-based request authorizers or MTLS
MQTT communication is protected with TLS1.2 and MTLS
Main service-service access control method is IAM permissions
Sensitives are stored with dedicated AWS services (Certificates with Certificate Manager or IoT Certificates, any other sensitive with SM Parameter Store SecureString)
For better security WAF could be added in front of our APIs however this will also add considerable cost. WAF could be a requirements for enterprise solutions but can be probably omitted for DIY.

Efficiency brief overview:

Our cloud backend is 100% serverless.
Our infrastructure code is an opportunity for continuous experiments and improvements.

Reliability brief overview:

Our serverless resources are using ‘on demand’ pattern (we’re not guessing load as it’s almost impossible for DIY applications)
Our whole infrastructure is coded so in case of failure can be easily redeployed
While it’s an overkill for DIY coded parametrized infrastructure opens a possibility for multi-regional deployments (if region reliability will be needed).
Our data storage can be easily backed up with AWS services however this is not part of the project to decrease cost.

Cost of our cloud backend

This is probably the most important topic for DIY projects. You definitely don’t want to finish up with unexpectedly high cloud bills.

I’m not publishing the whole cost model for our backend but the projections looks like this:

What is important to notice:

We are using the least expensive storage (S3) which adds up some complexity for API and maybe not suitable for enterprise solutions.
Our backend is 100% serverless which makes it extremely cost-effective (especially for DIY where load is absolutely unpredictable). We’re also using ARM computational resources to low the cost even more.
We’re also using AWS IoT Code Basic Ingest to decrease the cost of IoT Core messaging.
With such inexpensive systems some unexpected components start to be notable in the bill. Specifically the CloudWatch logs storage can become considerable part of the bill. We seriously restrict retention period for the logs to decrease the cost. This may be not suitable for enterprises at all but seems Ok for DIY.
Cost can be even lower if AWS Free tier will be used (for the first year at least).

Operational challenges of our cloud backend

With 100% coded infrastructure most of the operational challenges will be around failures monitoring and handling. The obvious improvement beyond that could be automated CICD with continuous deployment. This can be done with GitHub Actions or AWS CodePipeline for example.

Failures monitoring and handling is a large topic for potential improvements. All of our services use CloudWatch for logging so automation can be added.

Previous topic: Your own Cloud-IoT DIY project. Part 3: Device<->Cloud Communications

Next topic: Your own Cloud-IoT DIY project. Part 5: Helper tools and project config