Reliable support & service level agreement from IT Svit
SLA or Service Level Agreement is the key component of any contract in IT services. If you hire another company to perform some tasks, you want to make sure there is a precise description of the types of possible issues, detailed list of incident response times and an understandable, trackable issue resolution procedure in place.
IT Svit works under SLA that differs a lot from common IT service level agreement specimens. The main difference is that we have no service caps, meaning we do not set the limit on the number of incidents resolved monthly, nor do we specify the service credits for this reason. Instead, we apply our ample DevOps expertise to audit the infrastructure and workflows in place, identify the bottlenecks that hinder growth and negatively impact the stability of operations, and proactively remove them. This way, we minimize the numbers of incidents, ensuring the stability of your operations and decreasing the workload for our DevOps engineers, who support multiple projects each.
The goal of our Service Level Agreement is to establish a reliable partnership between IT Svit and our customers and describe the provided services, supported infrastructure components, notification and alerting methods, as well as the KPIs for every project we perform.
Under the SLA IT Svit has such rights:
- to demand the customer to follow the instructions and procedures we enact;
- to demand the compensation for the corrections we had to do to fix the issues resulting from the intervention of third parties to the operations;
- to select the project team composition and plan their work schedules ourselves.
Respectively, IT Svit has the responsibilities to:
- guarantee timely processing and resolution of incidents and requests covered by this SLA
- achieve the service KPIs depicted below and ensure stable operations of covered services
- employ third parties if it is needed to provide the services covered by this SLA
The customers have the following right:
- to monitor the process of service delivery in a way that does not impede the team’s productivity
Respectively, the customers have the following responsibilities:
- to provide all the documentation and login details required for service delivery
- to issue tickets regarding any incidents they encounter or requests they want to make
System components covered by service level agreement from IT Svit
While every business DNA is different and almost each cloud infrastructure is unique, they are all built with similar building blocks: code repositories, file storages, databases, server instances, Docker containers, etc. Below is the list of common system components and tools covered by IT Svit SLA, all of which can be used for both staging and production servers.
- Cloud platform or bare-metal servers (AWS, GCP, DO, Azure, OpenStack, OpenShift, etc)
- Container management services (EKS, GKE, ECS, etc)
- Virtual machine instances (Amazon EC2, Google Compute Engine, etc)
- File storage (Amazon S3, Google Storage Buckets, etc)
- SSH key management tools
- Virtual Private Clouds (Amazon VPC, Google VPC, etc)
- VPN instances
- NAT instances
- API connectors
- Web or mobile apps
- Databases (Amazon RDS, PostgreSQL, MySQL, MongoDB, Redis, Cassandra, etc)
- Docker Registry for images
- Jenkins cron jobs for infrastructure management tasks
- Monitoring tools (ELK stack, Prometheus & Grafana, Splunk, SumoLogic, etc)
- Nginx ingress controllers
- TLS/SSL certificate managers
- HashiCorp Vault for secret storage
- Code repositories and CI/CD tools (GitHub, GitLab, Circle CI, Gitlab CI, etc)
These are the basic infrastructure components present in nearly any infrastructure. The rest of the modules depend on the project and we work with a huge variety of cloud-specific or open-source instruments to build resilient, scalable and manageable infrastructures.
Workflows covered by IT Svit service level agreement
We ensure stable operations for multiple tasks and workflows, like:
- vertical and horizontal scaling
- VPN user management
- various application operations
- database management (backup & restore, sharding, replication, etc)
- API operations (connect, deploy, check logs, etc)
- web/mobile app operations (connect, deploy, check logs, etc)
- various operations required based on your project specifics
Thus said, IT Svit has ample experience with software development, Big Data analytics, blockchain development, Machine Learning models training, etc. In this case, the IT service level agreement is adjusted to reflect the modules and workflows involved.
Services provided under our SLA
This is by far not an exhaustive list of common incidents and our responses to them. Particular operations in this list will be replaced according to the needs of your project.
Environment | Object | State | Observation | SLA | Priority |
Production | Web Application | Login page is not accessible (unsuccessful probe, timeout, etc.) | Prometheus BlackBox exporter | > 1% during 1 minute | Critical |
Staging | Web application | Login page is not accessible (unsuccessful probe, timeout, etc.) | Prometheus BlackBox exporter | > 1% during 1 minute | High |
Production | Web Application | Certificate expiring | Prometheus BlackBox exporter | < 3 days | Critical |
Staging | Web Application | Certificate expiring | Prometheus BlackBox exporter | < 3 days | High |
Production | Web Application | 5xx response code rate | Monitor Nginx ingress controller logs for response codes | > 1% during 1 minute | Critical |
Staging | Web Application | 5xx response code rate | Monitor Nginx ingress controller logs for response codes | > 1% during 1 minute | High |
Production | Web Application | 2xx response time | Monitor Nginx ingress controller logs for response codes | > 1000 ms during 1 minute | Critical |
Staging | Web Application | 2xx response time | Monitor Nginx ingress controller logs for response codes | > 1000 ms during 1 minute | High |
Production | Web Application | CPU Utilization | Prometheus | > 90% during 5 minutes | High |
Staging | Web Application | CPU Utilization | Prometheus | > 90% during 5 minutes | Medium |
Production | Web Application | Memory Utilization | Prometheus | > 80% during 5 minutes | High |
Staging | Web Application | Memory Utilization | Prometheus | > 80% during 5 minutes | Medium |
Production | Web Application | Disk IO | Prometheus | > 100ms during 5 minutes | High |
Staging | Web Application | Disk IO | Prometheus | > 100ms during 5 minutes | Medium |
Production | API Application | API is not accessible (unsuccessful probe, timeout, etc.) | Prometheus BlackBox exporter | > 1% during 1 minute | Critical |
Staging | API Application | API is not accessible (unsuccessful probe, timeout, etc.) | Prometheus BlackBox exporter | > 1% during 1 minute | High |
Production | API Application | Certificate expiring | Prometheus BlackBox exporter | < 3 days | Critical |
Staging | API Application | Certificate expiring | Prometheus BlackBox exporter | < 3 days | High |
Production | API Application | 5xx response code rate | Monitor Nginx ingress controller logs for response codes | > 1% during 1 minute | Critical |
Staging | API Application | 5xx response code rate | Monitor Nginx ingress controller logs for response codes | > 1% during 1 minute | High |
Production | API Application | 2xx response time | Monitor Nginx ingress controller logs for response codes | > 1000 ms during 1 minute | Critical |
Staging | API Application | 2xx response time | Monitor Nginx ingress controller logs for response codes | > 1000 ms during 1 minute | High |
Production | API Application | CPU Utilization | Prometheus | > 90% during 5 minutes | High |
Staging | API Application | CPU Utilization | Prometheus | > 90% during 5 minutes | Medium |
Production | API Application | Memory Utilization | Prometheus | > 80% during 5 minutes | High |
Staging | API Application | Memory Utilization | Prometheus | > 80% during 5 minutes | Medium |
Production | API Application | Disk IO | Prometheus | > 100ms during 5 minutes | High |
Staging | API Application | Disk IO | Prometheus | > 100ms during 5 minutes | Medium |
Production | Database | 5432 port is not accessible | > 1% during 1 minute | Critical | |
Staging | Database | 5432 port is not accessible | > 1% during 1 minute | High | |
Production | Database | CPU Utilization | Prometheus | > 90% during 5 minutes | High |
Staging | Database | CPU Utilization | Prometheus | > 90% during 5 minutes | Medium |
Production | Database | Memory Utilization | Prometheus | > 80% during 5 minutes | High |
Staging | Database | Memory Utilization | Prometheus | > 80% during 5 minutes | Medium |
Production | Database | Disk IO | Prometheus | > 100ms during 5 minutes | Critical |
Staging | Database | Disk IO | Prometheus | > 100ms during 5 minutes | High |
Production | Redis | 6379 port is not accessible | > 1% during 1 minute | Critical | |
Staging | Redis | 6379 port is not accessible | > 1% during 1 minute | High | |
Production | Redis | CPU Utilization | Prometheus | > 90% during 5 minutes | High |
Staging | Redis | CPU Utilization | Prometheus | > 90% during 5 minutes | Medium |
Production | Redis | Memory Utilization | Prometheus | > 80% during 5 minutes | High |
Staging | Redis | Memory Utilization | Prometheus | > 80% during 5 minutes | Medium |
Production | Redis | Disk IO | Prometheus | > 100ms during 5 minutes | High |
Staging | Redis | Disk IO | Prometheus | > 100ms during 5 minutes | Medium |
Production | Vault | Cluster degraded | Logs | > 1 min | Critical |
Staging | Vault | Cluster degraded | Logs | > 1 min | High |
Production | Vault | 8200 port is not accessible | > 1 min | Critical | |
Staging | Vault | 8200 port is not accessible | > 1 min | High | |
Production | VPN | Developers don’t have access to the VPN instance | Google Virtual Private Cloud monitoring | > 40% during 1 day | High |
Staging | VPN | Developers don’t have access to the VPN instance | Google Virtual Private Cloud monitoring | > 40% during 1 day | Medium |
Request for the maintenance | Infrastructure components | There is a need to use existing infrastructure functionality to make changes without changing current behavior. | – | – | Medium |
Change request | Infrastructure components | There is a need for infrastructure changes/functionality with changing current behavior. | – | – | Low |
Request for providing information | Infrastructure and application components | There is a need in additional information about the system or its behavior. | – | – | Low |
IT Svit service level agreement — guaranteed performance for your business!
IT Svit provides reliable support for all kinds of IT operations, from software development and database administration to Big Data analytics, blockchain development, and Artificial Intelligence algorithms. We back our promises up with an in-depth service level agreement, and we do not aim for doing the least number of work to get paid. Instead, we aim at going an extra mile to remove the bottlenecks and minimize the numbers of incidents, which benefits all parties involved. Sounds too good to be true? Get in touch and see for yourself!