Kafka is the New Black
Kafka is not just rapidly replacing traditional messaging systems like MQSeries or streaming systems like Spark, Storm, Kinesis and others. It is also becoming the new data repository platform especially for very dynamic data sets in transactional systems and challenging traditional data warehouses for advanced analytics workloads, especially for Near Real Time (NRT) data analytics use cases.
Kafka is proving it’s worthy as a persistent data repository in financial services and other industries. Kafka clusters are processing billions of events per day and storing Tera Bytes of data long-term.
Near Real Time (NRT) analytics access to dynamic transactional data
This has been a boon for Data Scientists that are now able to perform advanced analytics, Machine Learning (ML) and Artificial Intelligence (AI) operations on transactional data in NRT. Critical tactical and strategic business decisions can be driven by what is happening now, rather than trying to extrapolate from transactions that happened days ago. These more immediate insights drive smarter, more targeted reactions and business decisions based on the most recent events.
Traditionally business analysts and data scientists were provided only limited access to the most current data in OLTP transactional systems. This was for data migration, performance, data security and privacy compliance reasons. They had to wait until the data was copied to OLAP systems, organized into more structured data sets and the required security controls in the form of View layer security was tested and implemented. At best this meant waiting for daily batch ETL processes or worse, time consuming data obfuscation processes to make various obfuscated versions of the exported data available sometimes days or weeks after the transactions or events occurred. Kafka is changing all that in a big way.
Refer to the Kafka Primer section later in this blog for a brief introduction to Kafka components.
Kafka was originally developed at LinkedIn as an internal infrastructure project. There were plenty of database options built to store data and ETL tools that could process data, but nothing to handle the continuous flow of data. This proved to have application far beyond real-time processing for a social network application. Today, retailers are redesigning fundamental business processes around continuous data streams while manufacturers are collecting and processing real-time data streams from internet-connected devices and vehicles. Financial Services companies are also rethinking their fundamental processes and systems around Kafka.
LinkedIn decided to make it openly available as part of the Apache Open Source marketplace. Several of the original creators of Kafka then founded Confluent to further develop, promote and commercially support Kafka as a platform. The founders quickly realized Kafka could be extended to meet a much broader range of functional requirements and only needed a few additional features or capabilities added to broaden its appeal. This has paid off as Kafka clusters are now being leveraged to manage and provide access to very large, persistent data sets.
Comparing Kafka to other alternatives
Think of Kafka as a kind of real-time version of Hadoop. Hadoop lets you store and periodically process file data at a very large scale. Kafka lets you store and continuously process streams of data at scale. This stream processing is a superset of the batch-oriented processing Hadoop and its various processing layers provide. Hadoop and “Big Data” targeted analytics applications, often in the data warehousing space. The low latency nature of Kafka makes it applicable for the kind of core applications that directly power a business. Events in business are happening all the time. Having the ability to react to them as they occur and build services that directly power business operations and feedback into customer experiences in NRT is changing the way banks and other organizations leverage their various data flows.
Kafka is designed to be implemented as a fully scalable, cluster of servers providing excellent fault tolerance and high availability in an easy to manage platform for NRT data processing. Data persistence is fully configurable along with decoupling of Publish/Subscribe processes for asynchronous operation whether they occur in real-time, NRT, batch of synchronous Request/Response operations.
What is missing is the comparable fine-grained Attribute-Based Access Controls (ABAC), accountability, data protection, data privacy, Data Loss Prevention (DLP) and User Behavior Analytics (UBA) available in varying degrees for traditional RDBMS platforms and other data repositories. This paradigm shift in how data is processed, shared and accessed has meant that traditional methods of providing even basic Role-Based Access Control (RBAC) no longer apply. Data security and privacy have been barriers to more rapid Kafka adoption whenever sensitive or regulated data is involved.
Kafka can be just as secure as traditional data repositories
Traditional data layer access controls and View layer security controls are not possible. The equivalent access controls must be abstracted from both the processes Producing the data and the applications Consuming the data and they must be platform agnostic. SecuPi provides the same centrally managed, policy-based, fine-grained access control, accountability, audit trail and data privacy compliance available on legacy applications and database systems to this new Kafka infrastructure. SecuPi applies data protection, access control and privacy compliance rules on the data the Producers are publishing and a different set of rules on the data the Consumer applications subscribing to various Topics are consuming. This is accomplished without changing applications, source systems or Kafka infrastructure.
Making Kafka relevant for processing sensitive or regulated data
SecuPi enables organizations to quickly realize the benefits of Kafka for NRT processing of any number of interrelated or disparate data flows. This paradigm shift in how organizations are managing their data does not release them from all the same responsibilities for data security and privacy compliance. Continuing to provide all the same data privacy and regulatory compliance in a Kafka environment is simple with SecuPi working on both the Producer and Consumer side.
This is achieved by providing the same consistent data protection, fine-grained Attribute-Based Access Controls (ABAC), User Behavior Analytics (UBA), accountability, anonymization and audit trail of all access to sensitive or regulated data enjoyed today with traditional N-Tier architectures providing various controls at the Presentation, Application and/or Data layers.
Integration of SecuPi with Kafka
SecuPi and Confluent have worked together to ensure seamless integration of SecuPi’s data protection and privacy compliance capabilities with Kafka implementations to provide the same standard SecuPi functionality available for traditional RDBMS platforms and Applications On-Premise or in the Cloud. Relevant SecuPi core functionality includes:
- Selective data encryption (message or field level) using Format Preserving Encryption (FPE)
- Selective decryption of protected data based on various user or data attributes
- Detailed audit trail of all access to sensitive or regulated data
- User Behavior Analytics (UBA) on Consumer(s) access to data
- Data Lineage and Data Flow Mapping of Produced and Consumed data
- Data Masking and Obfuscation
Structured schema like JSON or XML Messages enable SecuPi to provide fine-grained, field or value level access controls, encryption, masking or filtering based on various User or Data attributes.
Policies are defined within a central Policy Server and enforced on Produce and Consume processes in Kafka. SecuPi can also be used with the advanced Kafka Connect API and Kafka Streams Clients used for real-time processing of Messages.
SecuPi is fully scalable along with Kafka
SecuPi’s integration with Kafka takes all of this into consideration. SecuPi is also fully scalable running on each Producing and each Consuming process without introducing any performance bottleneck.
Generic Data Protection Use Cases
- Data protected on the Producers using various anonymization techniques (FPE encryption, masking, hashing). This protects the data before loading into cloud hosted data platforms and before it is available for data scientists and event processing on the Kafka platform.
- Data protected on the Consumers or Data Connectors (e.g. Snowflake Data Connector) –enabling encryption of specific data elements of customer data to enforce privacy compliance, Right To Be Forgotten (RTBF), Consent or Preference Management (Opt-In/Opt-Out).
- Decrypt data on Consumption if it was already encrypted from the producer before publishing.
- Data protection when accessing Topics – apply dynamic masking, redaction, decryption or encryption when running KSQL or other complex event processing on the data. This enables data scientists to perform analytics on encrypted data, or on the data in the clear without being granted access to view the data in the clear (calculate average salary but not see any salary).
- Providing fine grained access controls on Kafka platform (e.g. real-time monitoring of KSQL requests, filtering out VIP client data, GEO-Fencing, Dynamic Masking, Max Out limits (prevent excessive record count access to client data).
- Protect the data when creating files (e.g. when Kafka cluster used to create Avro data files). SecuPi is used to encrypt, mask or redact sensitive data elements or fields within these files.
- Hold Your Own Key (HYOK) – decryption keys can be segregated from the Kafka cluster and provided only to SecuPi overlay on the KSQL Server.
Kafka primer for the uninitiated
This section is included to provide a basic introduction to Kafka Components and Terminology. Skip if you are already a Kafka pro.
A unit of data in Kafka is a message which is somewhat comparable to a row or record in a traditional database table. Messages can be grouped into batches for more efficient processing much like batch processing on traditional database loading. A single Message or batch of messages are published or produced to the same Topic and Partition(s).
A Kafka message is just a collection of bytes. Normally some structure or schema is usually imposed on messages to facilitate more granular processing. JSON or XML formatted messages are common and contain key value pairs. Apache Avro is also common and supports abstracting the schema from the Message payload itself. This decoupling means that changes in Message format by Producer processes do not necessarily require corresponding changes to Consumer processes. It also means Producers and Consumers can operate independently as Real Time, NRT, Batch or Request/Response processes.
Messages are then categorized into Topics which are somewhat analogous to Tables in a Database or a Folder in a file system that contains many Messages (files). Partitions are a collection of Messages on a specific Topic. Multiple copies of a Topic (multiple Partitions) can be written to different servers in the cluster for redundancy and for read and write performance through parallel processing of the same Messages on different Partitions.
A data Stream in Kafka is most often referring to a Topic or stream of Messages regardless of the number of Partitions. Finally, there are the Kafka Clients (Producers and Consumers) that publish and subscribe on Topics. There are also advanced Client API’s like Kafka Connect API for real-time data integration and Kafka Streams for real-time stream processing. Consumers are normally assigned to specific Partition(s) sometimes referred to as ownership of a Partition. A Consumer Group work together to share the load and Consume all Messages on the Topic from multiple Partitions.
Kafka Clusters can be distributed for site redundancy using MirrorMaker to replicate published data.
Contact SecuPi for more information on how to leverage Kafka for collecting, processing and storing sensitive or regulated data without compromising data security or privacy compliance.
The author, Les McMonagle (CISSP, CISA, ITIL) is Chief Security Strategist at SecuPi and has over 25 years experience in information security, data privacy and regulatory compliance helping some of the largest and most complex organizations select appropriate data security technology solutions.