Our Content Management System (CMS) became an attack target

Last month, we didn’t just have a typical Monday. We experienced a real-world stress test. 

It started with a familiar sound: the alarm. Our front page was down, and the disruption was spreading to other services in our Kubernetes cluster. We were under a sustained Distributed Denial of Service (DDoS) attack. While these types of attacks are part of daily digital operations, this one was unusually large. 

The scale of the attack: What the numbers show 

This wasn't just a normal traffic spike:

  • Millions of requests: A massive volume of malicious traffic hammered our perimeter in under an hour.
  • 10x baseline surge: The request rate represented an order of magnitude increase over our normal operating baseline.
  • 1,024 (the hard limit): Our AWS Classic Load Balancer's surge queue hit its absolute ceiling, meaning the "waiting room" for our site was completely full.
  • 100% CPU saturation: CPU usage for our Ingress namespace exploded, pinning our pods at their limits as they fought to process the influx. 

What followed became a strong example of rapid response and cross-team collaboration. 

Service availability: When monitoring turned critical 

At the peak of the attack, our Amazon Route 53 health checks dropped to zero. This meant the backend systems were too overwhelmed to respond even to basic availability checks — resulting in temporary site-wide unavailability. 

A graph with lines and numbers

AI-generated content may be incorrect.Image 1. Route53 health checks  

The wall of traffic: Request volume surge 

The total request count hitting our infrastructure increased dramatically within a very short period. This sudden surge triggered our alerts and quickly exhausted available computing resources.  A graph with a line

AI-generated content may be incorrect.

Image 2. Requests on Elastic Load Balancing (ELB) 

The challenge: Shared entry points 

Our infrastructure runs on Amazon Elastic Kubernetes Service (EKS), using an Nginx Ingress Controller to manage incoming traffic. 

At the time, a single AWS Classic Load Balancer handled traffic to our front page. 

When the attack targeted our content management system (CMS), the traffic surge pushed our Ingress components to full processing capacity. Because other services — such as the customer portal — shared the same load balancer, the issue was not isolated to the CMS. The availability of the entire cluster was at risk. 

Our initial response followed standard procedure: we manually increased system replicas and relied on automatic scaling. However, the attack volume was too high for scaling alone to compensate. 

Importantly, our core products — Connect and Integrate, as well as Build and Operate — were not affected. This was possible because we follow strict Telekom security standards that separate front-end systems, backend systems, and customer environments at the network level.  

System saturation: Processing capacity reached its limit 

CPU metrics clearly showed 100% utilization in both the Nginx Ingress Controller namespace and the CMS namespace. Despite autoscaling, internal pods were completely saturated while attempting to process the malicious, human-like traffic patterns. 

A graph with purple lines

AI-generated content may be incorrect.Image 3. The CPU usage on Nginx Ingress controller namespace  

A graph on a black background

AI-generated content may be incorrect.Image 4. The CPU usage on CMS namespace 

User impact: Increased response times  

During the attack, average response times increased significantly. As systems struggled to handle the traffic load, users experienced noticeable delays and temporary service interruptions. 

This metric clearly illustrates the real-world impact before mitigation measures were fully implemented. 

A graph of a person

AI-generated content may be incorrect.Image 5. Average latency during the peak of the attack  

The waiting room filled: Load balancer capacity reached 

The load balancer’s surge queue reached its maximum capacity of 1,024 requests. 

This queue acts as a temporary holding area for incoming traffic. Once full, additional requests could not be forwarded to the backend systems — confirming that the infrastructure had reached full saturation. 

A graph with blue lines

AI-generated content may be incorrect.Image 6. Classic ELB’s SurgeQueueLength metric 

Automatic protection: Controlled rejections prevented collapse 

Once the queue limit was reached, additional requests were automatically rejected with “503 Service Unavailable” responses. 

Although this meant some users saw error messages, this mechanism served as a critical safety valve. It prevented the overload from spreading further and protected the Kubernetes cluster from a total system failure.  

A graph with lines and dots

AI-generated content may be incorrect.Image 7. Each spike in the "Spillover Count" represents requests that were automatically rejected 

The strategic shift: Moving protection closer to the edge 

With support from the Telekom Cyber Emergency Response Team and AWS Enterprise Support, we reassessed our architecture under pressure. 

AWS identified the Classic Load Balancer as a bottleneck and recommended migrating to an Application Load Balancer combined with a Web Application Firewall (WAF). 

However, performing a full migration during an active incident was not feasible. We needed immediate protection. 

The solution: Amazon CloudFront 

We deployed a new CloudFront distribution and pointed our domain to it. Because the CMS serves dynamic content, the configuration had to be precise:

  • Cache Policy: Managed-CachingDisabled
  • HTTP Methods: Full suite allowed (GET, POST, PUT, DELETE, etc.) to maintain full CMS functionality
  • Security: AWS WAF enabled directly on CloudFront with Core Protections, rate limiting, and SQL injection defense Closing the back door 

Initially, the attack did not stop. We observed that traffic was still hitting our Load Balancer directly — the attacker was bypassing CloudFront and targeting our public endpoints. 

That was our “aha” moment. 

To resolve this, we implemented a handshake mechanism between CloudFront and our Ingress controller. Only traffic coming through CloudFront was allowed. 

The back door was locked. Any request attempting to bypass the WAF was immediately rejected by Nginx. Within minutes, traffic patterns flattened back to baseline. The attacker gave up.  

The resilient test: A second attack 

A few weeks after the initial incident, our systems were targeted again by a similar large-scale DDoS attack. 

This was the real test. Would our new architecture withstand the same pressure that had previously disrupted our CMS?  

By this time, AWS CloudFront and AWS WAF were fully operational, and the infrastructure handled the surge exactly as designed. Malicious traffic was identified and blocked at the network edge — long before it could reach our Kubernetes cluster

A screenshot of a computer

AI-generated content may be incorrect. Image 8. AWS WAF activity during the follow-up attack 

Monitoring data clearly demonstrated the effectiveness of the new setup:

  • Total requests: 1.63M requests hit our CMS during the observation period.
  • Massive filtering: Our protection rules blocked over 854K malicious requests, effectively cutting the attack volume in half before it touched our servers. 

Maintaining 100% service availability 

The most significant difference was the stability of the services. During the first attack, our health checks dropped to zero. This time, our monitoring told a very different story 

A line of lines on a white background

AI-generated content may be incorrect. Image 9. Health check status during the second attack 

Proactive resilience: Telekom bug bounty 

While this real-world event ended successfully, we don’t wait for attacks to test our defenses. 

We actively participate in the Telekom Bug Bounty program (link), inviting white-hat hackers to continuously test our systems. This ensures vulnerabilities are identified by allies rather than adversaries. 

This culture of continuous security testing enabled us to pivot quickly during the incident. We knew our authentication layers were strong — we simply had to reinforce the front door. 

Read more