ECC SWAT

ECC SWAT

 Problem Statement  

CodeRed found and reported following incidents/issues which compromised the availability and user experience on their website. 

  • CodeRed application consuming very High CPU and memory of the server 
  • CPU Utilization greater than 70 percentage on VMs 
  • ECC-CodeRed Website accessibility getting compromised 
  • Hardware Resource utilization has peaked for servers abruptly without much increase in concurrent users 

Bellurbis was tasked with investigating the issues, creating a root cause analysis of the issues and optimize the application for future safeguarding against such incidents/issues.

Our Solution

Bellurbis analysed and monitored the holistic application, (application logs, DB logs, system logs, Server configuration, resource utilizations history etc.) to understand the impact of any incident as a whole system as well as localized and find the root cause. This exercise was done to ensure different components work seamlessly to handle the user load and provide smooth and consistent user experience throughout.  

Bellurbis checked and studied the incidents that were reported by CodeRed & optimized the website in following ways: 

  1. Infrastructure audit of the application 
  2. Monitored the system resource utilization 
  3. Application wide logs analysis 
  4. Analyzed and segregated the issues by mapping similarities and differences in the incident occurrences 
  5. Continuously and closely monitored KPIs  
  6. Analyzed historical data of and around incidents/issues to plot the real time picture of what the complete environment activities to better understand incidents/issues 
  7. Analyzed and mapped the pattern of similar incidents and predicted the next occurrence of it (so that we can capture the issue red-handed) 

Apart from identifying the root cause and solutions to the incidents, Bellurbis offered following suggestions to make root cause analysis simpler: 

  • Keep all types of logs (API logs, DB Query Logs etc.) for 2 weeks atleast 
  • Implement monitoring tools (such as Nagios) to capture the snapshot of all the processes running at a particular time when the resource utilization peaks 
  • Using Application Performance Monitoring (APM) tool