Kafka, have you heard this name somewhere? If so are you familiar with it? If not don’t fret you are in good company. Come, let’s learn Kafka together.
Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.
In Short, Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss.
Side note: Publish/Subscribe messaging (also called as Producer/Consumer model) is a form of asynchronous service to service communication. In a pub/sub model a message published to a topic is immediately received by all the subscribers in the topic. Pub/sub messaging can be used in event-driven architectures or to decouple the applications in order to increase performance.
Enough with definitions, now we will dive into parts of kafka one by one and explore further.
As applications evolve this will require lot of services to process data, eg. Databases, Caching stores, Distributed File storage, business intelligence etc. and different source systems that will generate data in a ultra fast manner. Maintaining connection with every target system is tedious and very expensive for the source system.
The source system has to maintain different set of protocols each one required by the target system, The data formats required for sending data along will also vary. This also involves the cost of maintaining the connection with every one of the target systems.
Factually, this is a classical producer consumer problem. Assume all the source systems as producers and the target systems as consumers. The data produced by one producer is requested by more consumers. Sending that data from the producer itself to all the target systems will increase the load on the source systems.
Kafka exist between these producer and consumer systems, receiving all the data produced by producers and allowing multiple consumers to consume the data from kafka.
Since, Kafka is specifically built for processing millions of messages per second it is highly reliable and have low latency.
Components of Kafka :
In event streaming, a topic is a particular stream of data ( similar to a table in related database without any constraints). The purpose of classifying into topics is mainly to ensure that same type of data will be published through that topic.
- A topic is identified within a kafka server uniquely by its name.
- We can have as many topics as we want.
- Topic is split into partitions.
As its name suggests, partition is a part of the topic. The messages pushed into the topic will exist in the partition.
The number of partitions required for a topic should be provided when the topic is created.
- A topic can have many partitions.
- Within each partition each message gets an number ( in incremental order) known as offset.
- Each data inside the partition will be ordered as per insertion.
- The offset will be always in incremental order and the already used number is never reused.
- Offsets have meaning only within the specific partition. Offset 1 of partition 2 is entirely different from Offset 1 of partition 0.
- Order is guaranteed within each partition but we can’t ensure order across partition.
The data inserted into the kafka ( inside partition) will only available for a short amount of time, default being 1 week. Once the data is inserted to the server, it can’t be changed i.e the data is immutable.
We can further explore the concepts of Kafka in the upcoming blogs