Skip to main contentLinuxForHealth FHIR® Server

Bulk export - Converting NDJSON to Parquet

By Lee Surprenant | Published May 10, 2022

Background

In IBM FHIR Server 4.4.0, we introduced experimental support for “export to parquet”. The feature was implemented by embedding a single-node Apache Spark cluster and using it to:

  1. infer a schema from a collection of JSON resources;
  2. write Parquet to Amazon S3 / IBM Cloud Object Storage.

I planned to either split this into a separate component or use an external Spark service for this feature (or both!), but the demand for the feature has not warranted the investment that would require. Thus, beginning with IBM FHIR Server 4.11.0, the “export to parquet” feature has been removed.

But fear not, the LinuxForHealth FHIR Server still supports exporting to newline-delimited JSON (NDJSON) on Amazon S3 / IBM Cloud Object Storage and users with access to the bucket can use these same Spark features to convert from NDJSON to Parquet.

Bulk Export

Bulk export can be performed via HTTP GET or POST and the LinuxForHealth FHIR Server supports three flavors:

  • System export: [base]/$export
  • Patient export: [base]/Patient/$export
  • Group export: [base]/Group/[id]/$export

The export operations are defined at https://hl7.org/fhir/uv/bulkdata/export.html and usage information can be found in the LinuxForHealth FHIR Server Bulk Data Guide.

For example, to export all Patient and Condition resources from an IBM FHIR Server at example.com:

curl --request POST \
--url 'https://example.com/fhir-server/api/v4/$export' \
--header 'Authorization: *****' \
--header 'Content-Type: application/json' \
--data '{
"resourceType": "Parameters",
"parameter": [
{
"name": "_type",

By default, the LinuxForHealth FHIR Server uses a psuedo-folder structure for the output files of each job. In the example above, it might produce output files like the following in the configured bucket:

  • long-job-id/Condition_1.ndjson
  • long-job-id/Condition_2.ndjson
  • long-job-id/Condition_3.ndjson
  • long-job-id/Patient_1.ndjson

Normally, a client would retrieve the exported NDJSON data from the download urls obtained from the $bulkdata-status URL in the Location header of the $export response. Users could then copy those files to their own S3 / Cloud Object Storage bucket (or any other Hadoop-compatible storage) for analysis. Alternatively, privileged users with access to the export bucket can operate directly over the exported files.

Convert from NDJSON to Parquet via Apache Spark

Given a properly configured Spark environment, converting the exported NDJSON files to Parquet can be done in just a few lines of code.

For example, using pyspark to operate over data in IBM Cloud Object storage in “us-east”:

import ibmos2spark
from pyspark.sql.functions import *
from pyspark.sql.types import *
credentials = {
'service_id': cos_api_key['iam_serviceid_crn'],
'api_key': cos_api_key['apikey'],
'endpoint': 'https://s3.private.us-east.cloud-object-storage.appdomain.cloud',
'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token'

The initial read may take some time as Spark must infer the schema from the data. However, that schema will be saved to the parquet output and, from there, the data can be loaded very quickly.

Spark automatically splits the data into a number of reasonably-sized parquet files (called “bucketing”), but it also provides configuration options so that you can optimize the parquet storage for your particular use.

Working with FHIR data from Apache Spark

Now that you have the data in a format that works well with Spark, you can use Spark to shape / transform the data into whatever format is most useful for your project. For an example, check out the recording from our FHIR DevDays presentation FHIR from Jupyter or jump straight to the notebooks.