Introduction to kafka, the high-speed data bus
In big data systems, there is often a problem that the whole big data is composed of various subsystems, and the data needs to flow constantly in each subsystem with high performance and low latency. Is there a system that can handle both online applications (messaging) and offline applications (data files, logs)? This requires kafka. Kafka can serve two roles.
1. Reduce the complexity of system networking.
2, reduce the complexity of programming, each subsystem is not negotiating with each other interface, each subsystem is similar to the socket plugged into the socket, Kafka assume the role of high-speed data bus.
Kafka is a messaging system open sourced by Linkedin in December 2010, which is primarily used to process active streaming data. Active streaming data is very common in web site applications, this data includes the pv of the site, what content users have visited, what content they have searched for, etc. This data is usually recorded in the form of a log and then processed statistically at regular intervals.
Traditional log analysis systems provide a scalable solution for processing log information offline, but there is usually a large delay if real-time processing is to be performed. And while existing elimination (queue) systems can handle real-time or near real-time applications well, unprocessed data is usually not written to disk, which can be problematic for offline applications like Hadoop (which processes only a fraction of the data in an hour or a day). Kafka is designed to solve exactly the above problems and it works well for both offline and online applications.