Big Data system architecture design from IT Svit

If your business wants to leverage Big Data analytics to provide more value to your customers or optimize your infrastructure management expenses, selecting the most appropriate Big Data analytics architecture is crucial. This requires an in-depth understanding of the best practices of Big Data system design based on experience with previous successfully developed Big Data solutions. IT Svit is ready to lend our expertise and help you build cost-efficient and highly-performant Big Data system architecture.

Managing cloud-based Big Data solutions

Every cloud vendor like AWS or GCP offers some kind of Big Data solutions as a service to its customers — Amazon Kinesis, for instance. The only issue here is that without an in-depth knowledge of these services, you end up at risk of overpaying or being unable to use these tools to their full potential. IT Svit helps businesses get the most out of their cloud computing investments and Big Data solutions.

Building cloud infrastructure for your Big Data analytics

The key component of a successful Big Data strategy is a resilient cloud infrastructure powering it up. Building such an infrastructure and aligning it with your business goals is essential for reaching the results you wanted to achieve. IT Svit provides dedicated teams that field extensive Big Data and DevOps expertise. We can help your business use your cloud infrastructure at full efficiency!

Ready to start?

When the time comes to implement Big Data analytics into your business workflows, the main question is not “how?” or “what for?”. It is “where?”. The point is, whether your Big Data solution is going to be aimed at delivering more value to your end-users or it will be used to minimize your cloud infrastructure OPEX — it has to be cost-efficient, able to withstand huge volumes and velocity of Big Data and able to provide the expected results quickly. In other words, you need to have the right Big Data architecture.

What does the correct Big Data design involve? It must include the following parts:

  • robust disk space resources for handling the volume needed for Big Data storage
  • ample CPU and RAM for dealing with a velocity at which the Big Data analytics work
  • a correctly chosen Machine learning model to process this data and deliver the results near real-time

Based on IT Svit experience, the best way to design and build Big Data architecture is to identify the project needs correctly and implement the Big Data solutions that will solve these challenges and provide the results you expect. However, sometimes the best decision is to avoid deploying a Big Data solution at all — and below we explain why it is so.

There are various cases, when while the influx of data is really big, the data itself is homogenous, without big variety of data types. For example, a wind farm might collect data about wind speed, direction, air temperature and motor temperature of turbines, and several other parameters — but all of them are accounted for, understandable and predictable. In that case, most of these sensors can be discarded and considered a single data input point while they are all transmitting parameter values inside the permitted range.

However, if some parameters begin to exceed the allowed range of values (a temperature is drastically rising at some motor sensor, for instance, or some outward sensor indicates an incoming gust of wind with gravel in it), the scripts detect this anomaly and apply one of the predefined response scenarios — a motor is stopped for investigation or the whole wind farm rotates the blades to minimize the impact with the gravel. While this performance requires quite a complex on-site edge computing technology and centralized cloud computing control system, it is quite well depicted with mathematical equations and Boolean logics, so deploying a full-scale Big Data solution there is not cost-efficient.

On the other hand, a customer might need to verify the product users by their IDs to ensure they are of legal age, for example. A photo-quality scan or picture of an ID is needed in such a case, as well as the user’s photo. Thus said, either a living person (a support representative) must compare these images in order to verify a user or a Machine Learning algorithm must be deployed.

This system must use one of the Optical Character Recognition algorithms, provided both by Amazon Web Services and Google Cloud Platform. These algorithms can be trained to analyze the content of any image and identify photos, text, numerical symbols, special symbols, etc. If the model validates that the user’s ID photo is the same as the user’s own photo, and the ID is a government-issued document, like a driver’s license or a passport, the user can be verified and use the system. IT Svit has experience with enabling this feature for a dating service with millions of active users.

To achieve this result, we started with a simple data warehousing to obtain enough data for training a Machine Learning model. There are billions of people faces on Shutterstock and other resources, as well as officially published templates for all government-issued ID’s. Data scientists from IT Svit combined all of this information into a huge data lake and applied several OCR algorithms to determine the most suitable one. We had to manually impose certain starting points, like the lines grid on a document that identifies it as a passport or a driver’s license, so the model can look for these grids to identify the documents as ID’s.

We also had to teach the model the range of values that can be present in the corresponding fields of these documents. For example, US driver’s licenses can be issued in one of the 51 states, so we added state acronyms to the model. The passports are generally issued at age 16, so as of 2019 no birth date in a passport can be earlier than 2003, etc. We provided lots of such basic points, which effectively resulted in reinforced learning for the model and shortened the product’s time to market by approximately 25% while lowering the model training expenses by at least 50%.

Thus said, the data was stored with Amazon Web Services and the initial data processing was performed by Amazon Kinesis. However, we later understood it was too costly and deployed our custom-tailored Machine Learning model as a Docker container on a Kubernetes cluster, along with the rest system components located at Amazon EC2. Now, this algorithm is deployed at an Amazon spot instance, saving up to 90% on maintenance costs.

Another customer, a distributed financial application, wanted IT Svit to design a Big Data solution for predictive analytics in order to optimize the performance of his complex cloud infrastructure. We started with forming a data lake including all the machine-generated data: system logs, customer support tickets, incident reports, etc. Then we deployed and trained a Deep Learning Network that has identified normal operational patterns and we wrote several response scenarios for most frequent service interruption use cases.

For example, in the case of the rapid growth of the numbers of visitors and CPU load, the system launches additional application instances to reduce the average CPU load to appropriate levels. Once the peak load was over, the excessive instances are immediately shut down to prevent overspending.

After this, the Big Data analytics solution began to work and it was able to reduce the average time-to-recovery for various issues by nearly 80%, as all the incidents were processed in near real-time. It also lowered the OPEX by at least 30%, as there were no unneeded resource usage or service downtimes any more.

Over the course of 5+ years of working with Big Data analytics, IT Svit was able to gather significant expertise with Big Data processing and design of Big Data solutions for various uses. We field experienced data scientists who are able both to augment your product with Machine Learning and to improve your operational efficiency through predictive Big Data analytics.

Configuring tools like Amazon EMR, Google AI and Amazon Kinesis for your needs

IT Svit was involved in plenty of projects where our customers went for cloud-based Big Data solutions from AWS or GCP, like Amazon Kinesis, Amazon EMR or Google AI engine. These products are positioned as “easy to use” — and indeed they are easy to set up and run — for skilled data scientists, that is. We had to onboard active projects to reconfigure their Big Data analytics modules — and we were able to do everything correctly. We can help you get the most out of your cloud computing investments by configuring the Machine learning systems correctly.

We build reliable cloud infrastructure for Big Data analytics

Not a single Big Data solution can work well if it is deployed to a flawed infrastructure. A true Big Data architect must also be a good DevOps system architect to be able to use various cloud services and DevOps tools to ensure the stable performance of a Big Data analytics system. IT Svit fields experienced Big Data enterprise architecture experts, who can design and deploy a system of any complexity while also ensuring it is both resilient and cost-efficient.

IT Svit is experienced with building reliable Big Data architecture and workflows for your business. Should you need help with enabling Big Data analytics — contact us, we are always ready to assist!

Contact Us



Our website uses cookies to personalise content and to analyse our traffic. Check our privacy policy and cookie policy to learn more on how we process your personal data. By pressing Accept you agree with these terms.

Contact Us