Introduction to kafka, the high-speed data bus
In big data systems, there is often a problem that the whole big data is composed of various subsystems, and the data needs to flow constantly in each subsystem with high performance and low latency. Is there a system that can handle both online applications (messaging) and offline applications (data files, logs)? This requires kafka. Kafka can serve two roles.
1. Reduce the complexity of system networking.
2, reduce the complexity of programming, each subsystem is not negotiating with each other interface, each subsystem is similar to the socket plugged into the socket, Kafka assume the role of high-speed data bus.
Kafka is a messaging system open sourced by Linkedin in December 2010, which is primarily used to process active streaming data. Active streaming data is very common in web site applications, this data includes the pv of the site, what content users have visited, what content they have searched for, etc. This data is usually recorded in the form of a log and then processed statistically at regular intervals.
Traditional log analysis systems provide a scalable solution for processing log information offline, but there is usually a large delay if real-time processing is to be performed. And while existing elimination (queue) systems can handle real-time or near real-time applications well, unprocessed data is usually not written to disk, which can be problematic for offline applications like Hadoop (which processes only a fraction of the data in an hour or a day). Kafka is designed to solve exactly the above problems and it works well for both offline and online applications.
Producer: Message and data generators
Proxy (Broker): caching proxy
Consumers: message and data consumers
The architecture is very simple, Producer, consumer implementation of the Kafka registration interface, data from the producer to the broker, broker to assume an intermediate cache and distribution role. The broker distributes the consumers registered to the system.
1. Use the linux filesystem cache directly to cache data efficiently.
2. Use linux Zero-Copy to improve sending performance. Traditional data sending requires four context switches. With the sendfile system call, data is exchanged directly in the kernel state, reducing the number of system context switches to two. According to the test results, it can improve the data sending performance by 60%. Detailed technical details of Zero-Copy can be found at: https://www.ibm.com/developerworks/linux/library/j-zerocopy/
3. Data is accessed on disk at a cost of O(1).
a. kafka uses topic for message management, each topic contains multiple parts (iterations), each part corresponds to a logical log with multiple segments composed.
b. Multiple messages are stored in each segment (see below), and the message id is determined by its logical location, i.e., the message id can be directly located from the message to the location where the message is stored, avoiding additional mapping of ids to locations.
c. Each part corresponds to an index in memory that records the first message offset in each segment.
d. The messages sent by publishers to a topic will be evenly distributed to multiple parts (randomly or according to the callback function specified by the user), and the broker receives the published message to add the message to the last segment of the corresponding part. When the number of messages on a segment reaches the configured value or the message publishing time exceeds the threshold, the messages on the segment will be flushed to disk, and only the subscribers of the messages flushed to disk can subscribe to them.
4. explicitly distributed, i.e. all the producers, brokers and consumers will be multiple, all distributed. There is no load balancing mechanism between the Producer and the broker. Load balancing between broker and consumer is done using zookeeper. All brokers and consumers are registered with zookeeper and zookeeper keeps some metadata information about them. If a broker and consumer changes, all other brokers and consumers will be notified.
RocketMQ: domestic Taobao team reference open source implementation of the message queue , to solve some of the problems of kafka , such as priority issues .
http://blog.chinaunix.net/uid-20196318-id-2420884.html
http://dongxicheng.org/search-engine/kafka/