Last month, we didn’t just have a typical Monday. We experienced a real-world stress test.
It started with a familiar sound: the alarm. Our front page was down, and the disruption was spreading to other services in our Kubernetes cluster. We were under a sustained Distributed Denial of Service (DDoS) attack. While these types of attacks are part of daily digital operations, this one was unusually large.
The scale of the attack: What the numbers show
This wasn't just a normal traffic spike:
- Millions of requests: A massive volume of malicious traffic hammered our perimeter in under an hour.
- 10x baseline surge: The request rate represented an order of magnitude increase over our normal operating baseline.
- 1,024 (the hard limit): Our AWS Classic Load Balancer's surge queue hit its absolute ceiling, meaning the "waiting room" for our site was completely full.
- 100% CPU saturation: CPU usage for our Ingress namespace exploded, pinning our pods at their limits as they fought to process the influx.
What followed became a strong example of rapid response and cross-team collaboration.
Service availability: When monitoring turned critical
At the peak of the attack, our Amazon Route 53 health checks dropped to zero. This meant the backend systems were too overwhelmed to respond even to basic availability checks — resulting in temporary site-wide unavailability.
Image 1. Route53 health checks
The wall of traffic: Request volume surge
The total request count hitting our infrastructure increased dramatically within a very short period. This sudden surge triggered our alerts and quickly exhausted available computing resources.
Image 2. Requests on Elastic Load Balancing (ELB)
The challenge: Shared entry points
Our infrastructure runs on Amazon Elastic Kubernetes Service (EKS), using an Nginx Ingress Controller to manage incoming traffic.
At the time, a single AWS Classic Load Balancer handled traffic to our front page.
When the attack targeted our content management system (CMS), the traffic surge pushed our Ingress components to full processing capacity. Because other services — such as the customer portal — shared the same load balancer, the issue was not isolated to the CMS. The availability of the entire cluster was at risk.
Our initial response followed standard procedure: we manually increased system replicas and relied on automatic scaling. However, the attack volume was too high for scaling alone to compensate.
Importantly, our core products — Connect and Integrate, as well as Build and Operate — were not affected. This was possible because we follow strict Telekom security standards that separate front-end systems, backend systems, and customer environments at the network level.
System saturation: Processing capacity reached its limit
CPU metrics clearly showed 100% utilization in both the Nginx Ingress Controller namespace and the CMS namespace. Despite autoscaling, internal pods were completely saturated while attempting to process the malicious, human-like traffic patterns.
Image 3. The CPU usage on Nginx Ingress controller namespace
Image 4. The CPU usage on CMS namespace
User impact: Increased response times
During the attack, average response times increased significantly. As systems struggled to handle the traffic load, users experienced noticeable delays and temporary service interruptions.
This metric clearly illustrates the real-world impact before mitigation measures were fully implemented.
Image 5. Average latency during the peak of the attack
The waiting room filled: Load balancer capacity reached
The load balancer’s surge queue reached its maximum capacity of 1,024 requests.
This queue acts as a temporary holding area for incoming traffic. Once full, additional requests could not be forwarded to the backend systems — confirming that the infrastructure had reached full saturation.
Image 6. Classic ELB’s SurgeQueueLength metric
Automatic protection: Controlled rejections prevented collapse
Once the queue limit was reached, additional requests were automatically rejected with “503 Service Unavailable” responses.
Although this meant some users saw error messages, this mechanism served as a critical safety valve. It prevented the overload from spreading further and protected the Kubernetes cluster from a total system failure.
Image 7. Each spike in the "Spillover Count" represents requests that were automatically rejected
The strategic shift: Moving protection closer to the edge
With support from the Telekom Cyber Emergency Response Team and AWS Enterprise Support, we reassessed our architecture under pressure.
AWS identified the Classic Load Balancer as a bottleneck and recommended migrating to an Application Load Balancer combined with a Web Application Firewall (WAF).
However, performing a full migration during an active incident was not feasible. We needed immediate protection.
The solution: Amazon CloudFront
We deployed a new CloudFront distribution and pointed our domain to it. Because the CMS serves dynamic content, the configuration had to be precise:
- Cache Policy: Managed-CachingDisabled
- HTTP Methods: Full suite allowed (GET, POST, PUT, DELETE, etc.) to maintain full CMS functionality
- Security: AWS WAF enabled directly on CloudFront with Core Protections, rate limiting, and SQL injection defense Closing the back door
Initially, the attack did not stop. We observed that traffic was still hitting our Load Balancer directly — the attacker was bypassing CloudFront and targeting our public endpoints.
That was our “aha” moment.
To resolve this, we implemented a handshake mechanism between CloudFront and our Ingress controller. Only traffic coming through CloudFront was allowed.
The back door was locked. Any request attempting to bypass the WAF was immediately rejected by Nginx. Within minutes, traffic patterns flattened back to baseline. The attacker gave up.
The resilient test: A second attack
A few weeks after the initial incident, our systems were targeted again by a similar large-scale DDoS attack.
This was the real test. Would our new architecture withstand the same pressure that had previously disrupted our CMS?
By this time, AWS CloudFront and AWS WAF were fully operational, and the infrastructure handled the surge exactly as designed. Malicious traffic was identified and blocked at the network edge — long before it could reach our Kubernetes cluster
Image 8. AWS WAF activity during the follow-up attack
Monitoring data clearly demonstrated the effectiveness of the new setup:
- Total requests: 1.63M requests hit our CMS during the observation period.
- Massive filtering: Our protection rules blocked over 854K malicious requests, effectively cutting the attack volume in half before it touched our servers.
Maintaining 100% service availability
The most significant difference was the stability of the services. During the first attack, our health checks dropped to zero. This time, our monitoring told a very different story
Image 9. Health check status during the second attack
Proactive resilience: Telekom bug bounty
While this real-world event ended successfully, we don’t wait for attacks to test our defenses.
We actively participate in the Telekom Bug Bounty program (link), inviting white-hat hackers to continuously test our systems. This ensures vulnerabilities are identified by allies rather than adversaries.
This culture of continuous security testing enabled us to pivot quickly during the incident. We knew our authentication layers were strong — we simply had to reinforce the front door.