With the recent release of Apache NiFi 1.10.0, it seems like a good time to discuss using Apache NiFi with data containing protected health information (PHI). When PHI is present in data it can present significant concerns and impose many requirements you may not face otherwise due to regulations such as HIPAA.
Apache NiFi probably needs little introduction but in case you are new to it, Apache NiFi is a big-data ETL application that uses directed graphs called data flows to move and transform data. You can think of it as taking data from one place to another while, optionally, doing some transformation to the data. The data goes through the flow in a construct known as a flow file. In this post we’ll consider a simple data flow that reads file from a remote SFTP server and uploads the files to S3. We don’t need to look at a complex data flow to understand how PHI can impact our setup.
Encryption of Data at Rest and In-motion
Two core things to address when PHI data is present is encryption of the data at rest and encryption of the data in motion. The first step is to identify those places where sensitive data will be at rest and in motion.
For encryption of data at rest, the first location is the remote SFTP server. In this example, let’s assume the remote SFTP server is not managed by us, has the appropriate safeguards, and is someone else’s responsibility. As the data goes through the NiFi flow, the next place the data is at rest is inside NiFi’s provenance repository. (The provenance repository stores the history of all flow files that pass through the data flow.) NiFi then uploads the files to S3. AWS gives us the capability to encrypt S3 bucket contents by default so we will use that through an S3 bucket policy.
For encryption of data in motion, we have the connection between the SFTP server and NiFi and between NiFi and S3. Since we are using an SFTP server, our communication to the SFTP server will be encrypted. Similarly, we will access S3 over HTTPS providing encryption there as well.
If we are using a multi-node NiFi cluster, we may also have the communication between the NiFi nodes in the cluster. If the flows only execute on a single node you may argue that encryption between the nodes is not necessary. However, what happens in the future when the flow’s behavior is changed and now PHI data is being transmitted in plain text across a network? For that reason, it’s best to set up encryption between NiFi nodes from the start. This is covered in the NiFi System Administrator’s Guide.
Encrypting Apache NiFi’s Data at Rest
The best way to ensure encryption of data at rest is to use full disk encryption for the NiFi instances. (If you are on AWS and running NiFi on EC2 instances, use an encrypted EBS volume.) This ensures that all data persisted on the system will be encrypted no matter where the data appears. If a NiFi processor decides to have a bad day and dump error data to the log there is a risk of PHI data being included in the log. With full disk encryption we can be sure that even that data is encrypted as well.
Looking at Other Methods
Let’s recap the NiFi repositories:
- The FlowFile Repository contains metadata (flowfile attributes, state, and pointer to content) for all the current FlowFiles in the flow.
- The Content Repository holds the content for current and past FlowFiles.
- The Provenance Repository holds the history (powers the data lineage) of FlowFiles.
PHI could exist in any of these repositories when PHI data is passing through a NiFi flow. NiFi does have an encrypted provenance repository implementation and NiFi 1.10.0 introduces an experimental encrypted content repository but there are some caveats. (Currently, NiFi does not have an implementation of an encrypted flowfile repository.)
When using these encryption implementations, spillage of PHI onto the file system through a log file or some other means is a risk. There will be a bit of overhead due to the additional CPU instructions to perform the encryption. Comparing usage of the encrypted repositories with using an encrypted EBS volume, we don’t have to worry about spilling unencrypted PHI to the disk, and per the AWS EBS encryption documentation, “You can expect the same IOPS performance on encrypted volumes as on unencrypted volumes, with a minimal effect on latency.”
There is also the NiFi EncryptContent processor that can encrypt (and decrypt despite the name!) the content of flow files. This processor has use but in very specific cases. Trying to encrypt data at the level of the data flow for compliance reasons is not recommended due to the data possibly existing elsewhere in the NiFi repositories.
Removing PHI from Text in a NiFi Flow
What if you want to remove PHI (and PII) from the content of flow files as they go through a NiFi data flow? Check out our product Philter. It provides the ability to find and remove many types of PHI and PII from natural language, unstructured text from within a NiFi flow. Text containing PHI is sent to Philter and Philter responds with same text but with the PHI and PII removed.
Conclusion
Full disk encryption and encrypting all connections in the NiFi flow and between NiFi nodes provides encryption of data at rest and in motion. It’s also recommended that you check with your organization’s compliance officer to determine if there are any other requirements imposed by your organization or other relevant regulation prior to deployment. It’s best to gather that information up front to avoid rework in the future!