AWS native Monitoring

Short writeup about native and common AWS monitoring solutions: CloudWatch, X-Ray, and CloudTrail

AWS CloudWatch

AWS CloudWatch Metrics

  • CloudWatch provides metrics for almost all the services in AWS
  • “Metric” is a variable to monitor (CPUUtilization, NetworkIn, …)
  • Metrics belong to “namespaces”
  • “Dimension” is an attribute of a metric (instance id, environment, etc…)
  • Up to 10 dimensions per metric
  • Metrics have “timestamps”
  • Can create CloudWatch dashboards of metrics

AWS CloudWatch EC2 Detailed monitoring

  • EC2 instance metrics have metrics “every 5 mins”
  • With detailed monitoring (for a cost), you get data “every 1 min”
  • Use detailed monitoring if you want to more prompt scale your ASG!

  • Note: EC2 Memory usage is by default not pushed (must be pushed from inside the instance as a custom metric)

AWS CloudWatch Custom Metrics

  • Possibility to define and send your own custom metrics to CloudWatch
  • Ability to use dimensions (attributes) to segment metrics
    • Instance.id
    • Environment.name
  • Metric resolution
    • Standard: 1 minute
    • High Resolution: up to 1 second (StorageResolution API Parameter) - lead to Higher cost
  • Use API call “PutMetricData”
  • Use exponential back off in case of throttle errors if talking to gibberish with the Management API

AWS CloudWatch Alarms

  • Alarms are used to trigger notifications for any metric
  • Alarms can go to Auto Scaling, EC2 Actions, SNS notifications
  • Various options (sampling, %, max, min, etc…)
  • Alarm states: OK, INSUFFICIENT_DATA, ALARM
  • Period:
    • Length of time in seconds to evaluate the metric
    • High resolution custom metrics: can only choose 10 secs or 30 secs

AWS CloudWatch Logs

  • Applications can send logs to CloudWatch using the SDK
  • CloudWatch can collect logs from:
    • Elastic Beanstalk: collection of logs from application
    • ECS: collection from containers
    • AWS Lambda: collection from function logs
    • VPC Flow Logs: VPC specific logs
    • API Gateway
    • CloudTrail based on filter
    • CloudWatch log agents: for example on EC2 machines
    • Route53: Log DNS queries
  • CloudWatch logs can go to:
    • Batch exporter to S3 for archival
    • Stream to ElasticSearch cluster for further analytics

CloudWatch Logs for EC2

  • By default, no logs from your EC2 machine will go to CloudWatch
  • You need to run a CloudWatch agent on EC2 to push the log files you want

CloudWatch Logs Agent & Unified Agent

  • Both are for virtual servers (EC2 instances, on-premise servers)
  • CloudWatch Logs Agent
    • Old version of the agent
    • Can only send to CloudWatch Logs
  • CloudWatch Unified Agent
    • Collect additional system-level metrics such as RAM, processes, etc
    • Collect logs to send to CloudWatch Logs
    • Centralized configuration using SSM Parameter Store

CloudWatch Logs Metric Filter

  • CloudWatch Logs can use filter expressions
    • For example, find a specific IP inside of a log
    • Or count occurrences of “ERROR” in your logs
    • Metric filter can be used to trigger alarms then
  • Filters do not retroactively filter data. Filters only publish the metric data points for events that happen after the filter was created.

AWS CloudWatch Events

  • Schedule: Cron jobs
  • Event Pattern: Event rules to react to a service doing something
    • Example: CodePipeline state changes!
  • Triggers to Lambda functions, SQS/SNS/Kinesis Messages
  • CloudWatch Event creates a small JSON document to give information about the change

Amazon EventBridge

  • EventBridge is the next evolution of CloudWatch Events
  • Default event bus: generated by AWS services (CloudWatch Events)
  • Partner event bus: receive events from SaaS service or applications (Zendesk, DataDog, Segment, Auth0, …)
  • Custom event buses: for your own applications
  • Event buses can be accessed by other AWS accounts

  • Rules: how to process the events (similar to CloudWatch Events)

Amazon EventBridge Schema Registry

  • EventBridge can analyze events in your bus and infer the schema
  • The Schema Registry allows you to generate code for your application that will know in advance how data is structured in the event bus
  • Schema can be versioned

Amazon EventBridge vs CloudWatch Events

  • Amazon EventBridge builds upon and extends CloudWatch Events
  • It uses the same service API and endpoint, and the same underlying service infrastructure
  • EventBridge allows extension to add event buses for your custom applications and your third-party SaaS apps
  • EventBridge has the Schema Registry capability

  • EventBridge has a different name to mark the new capabilities
  • Over time, the CloudWatch Events name will be replaced with EventBridge

AWS X-Ray

  • Debugging in Production, the good old way:
    • Test locally
    • Add log statements everywhere
    • Re-deploy in production
  • Log formats differ across applications using CloudWatch and analytics is hard
  • Debugging: monolith “easy”, distributed services “hard”
  • No common views of your entire architecture!

AWS X-Ray advantages

  • Troubleshooting performance (bottlenecks)
  • Understand dependencies in a microservice architecture
  • Pinpoint service issues
  • Review request behavior
  • Find errors and exceptions
  • Are we meeting time SLA?
  • Where am I throttled?
  • Identify users that are impacted

AWS X-Ray leverages “Tracing”

  • Tracing is an end-to-end solution to follow a “request” across multiple hops > SPANS
  • Each component dealing with request adds its own “trace”
  • Tracing is made of segments (+ sub segments)
  • Annotations can be added to traces to provide extra-information
  • Ability to trace:
    • Every request
    • Sample request (as a & for example or rate/min)
  • X-Ray Security
    • IAM for authorization
    • KMS for encryption at rest

How to enable AWS X-Ray?

  • Your code must import the AWS X-Ray SDK
    • Very little modification needed
    • The application SDK will then capture:
      • Calls to AWS services
      • HTTP / HTTPS requests
      • Database calls (MySQL, PostgreSQL, DynamoDB)
      • Queue calls (SQS)
  • Install the X-Ray daemon or enable X-Ray AWS Integration
    • X-Ray daemon works as a low-level UDP packet interceptor (Linux, Windows, Mac)
    • AWS Lambda / other AWS services already run the X-Ray daemon for you
    • Each application must have the IAM rights to write data to X-Ray

AWS X-Ray Troubleshooting

  • If X-Ray is not working on EC2
    • Ensure the EC2 IAM Role has the proper permissions
    • Ensure the EC2 instance is running the X-Ray Daemon
  • To enable on AWS Lambda:
    • Ensure it has an IAM execution role with proper policy (AWSX-RayWriteOnlyAccess)
    • Ensure that X-Ray is imported in the code

X-Ray Instrumentation in your code

  • Instrumentation means the measure of product’s performance, diagnose errors, and to write trace information

X-Ray Concepts

  • Segments: Each application / service will send them
  • Sub-segments: If you need more details in your segment
  • Trace: segments collected together to form an end-to-end trace
  • Sampling: decrease the amount of requests sent to X-Ray, reduce cost
  • Annotations: Key-value pairs used to index traces and use with filters
  • Metadata: Key-value pairs, not indexed, not used for searching

  • The X-Ray daemon / agent has a config to send traces cross account
    • make sure the IAM permission are correct - the agent will assume the role
    • This allows to have a central account for all your application tracing

X-Ray Sampling Rules

  • With sampling rules, you control the amount of data that you record
  • You can modify sampling rues without changing your code

  • By default, the X-Ray SDK records the first request “each second”, and “five percent” of any additional requests

  • One request per second is the “reservoir”, which ensures that at least one trace is recorded each second as long as the service is serving requests
  • Five percent is the “rate”, at which additional requests beyond the reservoir size are sampled

AWS CloudTrail

  • Provides governance, compliance and audit for your AWS account
  • CloudTrail is enabled by default
  • Get an history of events / API calls made within your AWS account by:
    • Console
    • SDK
    • CLI
    • AWS Services
  • Can put logs from CloudTrail into CloudWatch Logs
  • If a resource is deleted in AWS, look into CloudTrail first.
Written on October 16, 2022


◀ Back to the Pensieve