Transparent Column Encryption for Hive and Spark
As of today, a growing number of organizations are storing their customers’ sensitive information on big data platforms such as hive or spark. Companies are dealing with unlimited amounts of data with varying sensitivity levels and, therefore, need to find solutions for storing them safely and efficiently. Roughly 8000 enterprises are using the services of Hive and Spark worldwide of which the vast majority are from the computer software and information technology sectors and needless to say that it is only the beginning. These platforms serve as crucial solutions in the market as it makes more sense to upload to the cloud when using big data because of computing resources and time-saving reasons.
Although bringing customers information to big data platforms on the cloud seems like the right approach, companies may face some issues regarding the implementation process of numerous security regulations, such as GDPR, and the protection of their customers’ sensitive data.
What are some common solutions today?
The common approach today is to apply encryption transparently on the HDFS level which means that all users accessing through analytics applications and tools are able to see the data in the clear. But this technique is problematic as it grants unauthorized users access to sensitive data that is not required for their job, violating the “need to know basis” requirement.
Metaphorically speaking, this would be like locking your home windows, but leaving the front door open with a direct access right to your sensitive data. For our customers, this means thousands of open doors. Also, as GDPR at its core enforces access on a “need-to-know basis” with its “security by design”, consent and “legal basis” requirements, encrypting data at the HDFS layer is far from being enough.
Supposedly, the most efficient solution, in this case, would be to encrypt the data at the Hive or analytics application layer but this manipulation requires tedious configuring and maintenance of UDF’s which could potentially be done for one specific column but is extremely time-consuming when applying it on thousands of personal columns, with new ones popping up every hour.
What is the solution then?
SecuPi application-overlay replaces the need to apply and maintain UDF’s and is in synchronization with the entire Big data lifecycle. The process can be split into 3 components:
- SecuPi for Ingestion: SecuPi automatically encrypts or masks data that is loaded using NiFi and other 3rd Party ingestion tools
- SecuPi for Processing: SecuPi overlay is configured on the Hive and Spark – applying decryption for authorized users/usage while leaving the data encrypted to unauthorized and suspicious analytics requests
- SecuPi for Consumption: SecuPi overlay is configured on analytics applications to ensure data is decrypted only when legitimate and appropriate personal data access is required
Companies deploy the SecuPi overlay on their business applications in order to ensure the protection of their sensitive data as the platform ensures data security from source to destination across all layers. Our solution can be set up and implemented within days as encryption keys are managed by the SecuPi overlays and requires low maintenance as automatic and supervised ingestion encryption options exist for new data flows across all formats (e.g., JSON, CSV, ORC, Avro, Parquet).No manual UDF’s are required and negligible performance overhead can be predicted as encryption/decryption is done on the at the overlay level with no API calls.
Last but not least, SecuPi enables functional coverage through data protection for all your analytics life-cycle with encryption, masking, real-time monitoring, anomaly detection and access control as well as platform coverage which is used to expand your data protection from Big data to Teradata enterprise data warehouse and business applications.