Business Insights & Data Analytics Platform
We were contacted by a co-founder & COO at a real-time Big Data analytics platform that provides information based on Artificial Intelligence to make it easier than ever before to get the most out of your data. The product helps dig valuable insights out of the mounds of data your business has access to.
Location: Los Angeles, CA
Industry: Business Intelligence and Big Data analytics
Partnership period: December 2018 — ongoing
Team size: 1 team lead, 2 DevOps engineers, Big Data engineer
Team location: Kharkiv, Ukraine
Services: Cloud infrastructure design and optimization, database performance optimization, Machine Learning model training, monitoring implementation, CI/CD implementation
Expertise delivered: App containerization and container management, ML training and management, cloud infrastructure management, monitoring implementation, CI/CD
Technology stack: AWS, Coordinator, Alertmanager, G Suite, GitHub, Jenkins, Kibana, OpenVPN, Prometheus, Spinnaker, Superset.
Project requirements
The customer has a pretty large backend Big Data system running on AWS infrastructure. They needed IT Svit to help with the following tasks:
- Planning and performing the upgrade for their Druid instance
- Implementing enhanced system monitoring and alerting with Amazon Cloudwatch, Prometheus & Grafana and AlertManager
- Configuring a staging environment. They needed access to expertise with Kubernetes, Ansible, Docker, Apache Kafka, Hadoop and Druid, Postgres, Node, AlertManager and Prometheus & Grafana.
- Plan and build a CI/CD pipeline using Jenkins to improve the insights processing pipeline
Results
IT Svit was able to accomplish all the tasks in the order of importance:
- We moved all the apps from Marathon to Amazon EKS and optimized the system architecture. As a result, Druid cluster upgrade time was reduced 4 times, from 1h to 15 minutes.
- We found and resolved the issue that caused application workers to restart frequently on legacy clusters.
- We configured the CI/CD pipeline using Jenkins to automate daily cluster operations.
- We connected Spinnaker to Prometheus using AlertManager. System abnormal behavior is immediately alerted to Slack and is being resolved much quicker now.
Challenges and solutions
This analytics company has experienced several challenges that hindered product growth and reduced the stability of operations:
- Marathon infrastructure was not able to deal with the ever-growing number of requests
- Some Big Data input sources were slow and slowed down the whole platform, and the Druid version was very outdated
- There was little to no insight on the system performance due to the complexity of the backend structure
We solved these issues successfully:
- Set up Airflow for Data Analysis automation.
- Updated Druid and Spinnaker to the latest versions
- Set up the ECS cluster and moved the majority of our apps from Marathon to Amazon EKS.
- Installed and configured Jenkins pipelines for automated application updates.
- Set up monitoring and notifications using Prometheus, Grafana and AlertManager.
Conclusions
By updating Druid and Spinnaker, moving from Marathon to EKS, resolving the issues with Jenkins and implementing in-depth monitoring with smart alerting IT Svit helped the customer to ensure the systems operate at top efficiency, are updated seamlessly and any incidents are instantly reported and quickly resolved.