The types of Big Data tools IT Svit uses
-
3778
-
0
-
0
-
0
While Big Data is still more of an umbrella buzzword for many productive, easily scalable and cost-efficient tools and solutions, the true meaning depends on who uses the term and for what reason.
Thus said, we decided to describe the types of Big Data tools IT Svit uses.
As we have already mentioned in one of our previous articles, the Big Data is described by three V’s:
- Volume
- Variety
- Velocity
The projects we were involved in were mostly related to working with huge amounts of textual information on the Internet and its optimization, so we had to solve the tasks of dealing with big volumes of incoming data we needed to process, store and manage, which lead to learning to work with reliable databases; big variety of data we needed to gather and process, which led to using decentralized web crawling and various other techniques of content analysis; and having to provide high velocity of data processing, which resulted in using asynchronous Python multiprocessing and using queueing with RabbitMQ or SQS to build easily scalable, high-performance networks.
Big Data databases: Redis, Cassandra, MongoDB
Redis worked well for us as an in-memory key-value database used to deliver a decentralized queue for analyzing the textual content. This was needed to enable our project to gather the data for future processing.
Cassandra, a well-known and proven choice for storing and managing assorted data (like a historical data within some range, say telemetry) has met our expectations and provided great fault tolerance, easily dealing with exceptional I/O workloads we used when working with our project algorithm. This was possible due to its built-in sharding capabilities and Cassandra did deliver excellent results in storing and processing the flow of data, scaling easily as the need arose.
MongoDB, a document-oriented database served our requests of storing various data for lesser projects, that was later easily processed and interchanged with the rest of the databases. Its scalability and flexibility helped us deal with querying and indexing of the data sets
Asynchronous applications
One of the issues we are facing while processing the datasets is an ever-growing quantity of concurrent events our applications have to handle. As the language chosen is Python, both multithreading and multiprocessing is required to handle this situation. Thus said, we went for asynchronous I/O architecture of our applications to ensure the stability and continuity of our microservices. The other part of the solution was configuring message brokers.
RabbitMQ and SQS queue brokers
We are skilled with using both Rabbit and SQS message brokers, as many of our operations happen within AWS. However, we do prefer using RabbitMQ when possible, as it has more functions and allows self-addition of new web-services to the list. SQS, on the other hand, scales horizontally with ease and works perfectly within AWS infrastructures.
Conclusions
Thus said, we are able to design, deploy, configure and maintain any infrastructure, using trusted and reliable Big Data tools. We have a decent experience working on both short and long-term projects of any complexity and stand ready to deliver top-notch services to our customers.