The Topic, the Message and the Partition
Traditional messaging patterns: message queue and publish – subscribe, have some limitations as a result of their design.
In the previous post – Apache Kafka Ideas – Part 1, a couple of messaging use cases were introduced. In order to define those cases with Kafka, it is important to understand its ideas. At the very heart of Kafka are topics and partitions. This post explains basic concepts behind them.
A topic is an abstract entity, which acts as a mailbox for messages. Each message in Kafka has to be sent to a topic. Kafka uses the follwoing naming conventions:
- message senders are called Producers
- message recipients are called Consumers
- single Kafka instance is called a Broker
- producers send data to a broker
- consumers poll for data from a broker
The image below illustrates Kafka’s message structure:
- message Key is optional
- message Value is required, it contains the actual payload
- message Timestamp is also optional and it can be set by the sender, however certain rules apply on the broker side whether to override it or not
Partitions unlike topics have physical representations in Kafka. Each message is persisted on a disk within a partition:
- partitions are folders that aggregate messages on a disk
- partition name, or folder name if you like, follows the following pattern:
- partitions contain log files (*.log)
- log files are physical containers for messages
- each topic must have at least one partition
- partition numbering starts from 0
Physical representation of a topic named hire:
The additional *.index and *.timeindex files support Kafka engine in looking up messages that are stored on a disk. /tmp/kafka-logs is the logs root defined in a broker’s configuration. In the example above, the topic’s name is hire. The partitions (folders hire-0 and hire-1) are stored directly under the messages’ root, hence the topic folder is not maintained.
There are certain rules that define how messages are delivered to a particular partition – these will be covered later in the series. For now let’s assume that when a message is sent, it will go to a partition that meets the follwoing equation – let’s call it round robin selection:
partition = message_no % partition_count
In Kafka, a partition is a structured commit log. It means that:
- messages are appended in order they are received
- each message has its corresponding position in the log called offset
- offset and order of messages are maintained per partition
- offset numbering starts from 0
Offset is never reset! It is Java’s Long, which means that 2^63 – 1 messages can be stored in one partition. If 1 million messages were sent per second, a producer would need 292 471 years to fill up the whole partition.
In the next post we will discuss consumer groups which would allow to finally define an alternative solution for the message queue and the publish – subscribe traditional patterns.