Handling JSON with jq

01 Sep 2019

Intro

jq is a power tool for working with JSON on the command-line. It’s incredibly useful when working with JSON data such as JSON documents or API responses. It’s one of several popular tools for manipulating JSON including JMESPath and Jsonnet. Although I have become particular fond of jq, you might want to look into JMESPath for its integration into Amazon Web Services (AWS) and Azure command-line tools.

If you don’t feel like installing jq on your machine, you can follow along on jq play.

Extracting Data

We can extract data from a JSON document like the one below. Let’s say this data is stored in data.json.

[
  {
    "name": "Ada",
    "reports": [
      "Boris",
      "Becky",
      "Booker"
    ]
  },
  {
    "name": "Booker",
    "reports": [
      "Carly",
      "Chris"
    ]
  },
  {
    "name": "Carly",
    "reports": [
      "Diana",
      "David"
    ]
  }
]

Let’s say we want all of the supervisor’s names:

% jq '. [] .name' < data.json
"Ada"
"Booker"
"Carly"

If we don’t want the quotes, we can use --raw-output (or -r), which is a command-line flag used to output raw values rather than JSON data:

% jq '. [] .name' -r < data.json
Ada
Booker
Carly

We can get all of the reports:

% jq '. [] .reports []' -r < data.json
Boris
Becky
Booker
Carly
Chris
Diana
David

We can get all of the employees together by combining the two filters:

% jq '. [] .reports [], . [] .name' -r < data.json
Boris
Becky
Booker
Carly
Chris
Diana
David
Ada
Booker
Carly

We can pipe data into jq from an API result in order to pretty print it. We can also use this in scripting scenarios.


% curl http://dummy.restapiexample.com/api/v1/employees | jq .
[
  {
    "id": "41393",
    "employee_name": "Fernando",
    "employee_salary": "4061",
    "employee_age": "68",
    "profile_image": ""
  },
  {
    "id": "41394",
    "employee_name": "SLKJDLKJ",
    "employee_salary": "32",
    "employee_age": "43",
    "profile_image": ""
  },
  ...

Transformations

jq has powerful transformation capabilties. Using the example above, we can build new objects from the data:

% jq '. [] | { firstName: .name }'  < data.json
{
  "firstName": "Ada"
}
{
  "firstName": "Booker"
}
{
  "firstName": "Carly"
}

The map filter allows us to process array items in place. It works similarly to the map higher-order function as defined in many other languages.

% jq 'map({firstName: .name}) < data.json
[
  {
    "firstName": "Ada"
  },
  {
    "firstName": "Booker"
  },
  {
    "firstName": "Carly"
  }
]

We can filter the results returned by applying selectors to each object of an array. We do this using the select filter.

% jq 'map(select(.name == "Ada"))' < data.json
[
  {
    "name": "Ada",
    "reports": [
      "Boris",
      "Becky",
      "Booker"
    ]
  }
]

And we can combine selectors with transformations:

jq 'map(select(.name == "Ada") | { firstName: .name }) ' < data.json
[
  {
    "firstName": "Ada"
  }
]

Functions

jq comes with a bunch of built-in functions we can use to transform and filter data, such as reversing arrays:

% jq ' .  | reverse[] |.name' -r < data.json
Carly
Booker
Ada

Filtering based on content using the startswith function:

% jq ' . [] | .name | select(. | startswith("A")) ' -r < data.json
Ada

String interpolation using a back-slash pattern:

% jq ' map("Hi, \(.name)")[]' -r < data.json
Hi, Ada
Hi, Booker
Hi, Carly

And so many other functions you should investigate in the manual.

Working with AWS Results

We can use jq to work with AWS API data. Assuming you have an AWS account, you can install awscli to work with the AWS APIs from the command-line. One of the simplest things you can do is list publicly available AWS Elastic Compute Cloud (EC2) images. These are images you can use to create new virtual machines. This takes a long time to run, so we’ll save the results in a local file.

aws ec2 describe-images --output json > aws_images.json

The image data looks something like this:

{
  "Images": [
    {
      "Architecture": "i386",
      "CreationDate": "",
      "ImageId": "aki-00806369",
      "ImageLocation": "karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml",
      "ImageType": "kernel",
      "Public": true,
      "OwnerId": "099720109477",
      "State": "available",
      "BlockDeviceMappings": [],
      "Hypervisor": "xen",
      "RootDeviceType": "instance-store",
      "VirtualizationType": "paravirtual"
    },
    {
      "Architecture": "i386",
      "CreationDate": "",
      "ImageId": "aki-00896a69",
      "ImageLocation": "karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091002-test-04.manifest.xml",
      "ImageType": "kernel",
      "Public": true,
      "OwnerId": "099720109477",
      "State": "available",
      "BlockDeviceMappings": [],
      "Hypervisor": "xen",
      "RootDeviceType": "instance-store",
      "VirtualizationType": "paravirtual"
    },
    // ... more
  ]
}

I’ve cut out most of the results since when I ran it, there were 123,521 public images. We can take limit the results to the first 1000 images by using the following:

cat aws_images.json | jq '.Images |= .[0:1000]' > short_aws_images.json

If we want to get a list of the providers of Ubuntu images, and the versions of those images, we can use the .ImageLocation property of each image dictionary, and use a regular expression to parse that location, capturing information about the Ubuntu release and version. The capture function allows us to use named capture groups to transform the results into a new JSON object.

cat short_aws_images.json \
    | jq '.Images[] | select(.ImageLocation | contains("ubuntu")) | .ImageLocation | {ImageLocation: .} + capture("^(?<provider>\\w+)\/.*ubuntu(?:-(?<release>trusty|xenial|bionic|eoan))?-(?<version>(?:[\\d\\.]+|daily))")'

The command is a little hard to read, but we can save it as a filter file that can be loaded via the --from-file (or -f) flag. We’ll save it as ubuntu.jq_filter with the contents below. One big advantage of storing the filter as a file is that we can use comments to clarify what we are trying to do.

.Images[] # unpack the array at the `Images` key
| select(.ImageLocation | contains("ubuntu")) # select all objects where the `ImageLocation` key contains ubuntu
| .ImageLocation # grab the `ImageLocation` key value from each object
| {ImageLocation: .} # Define a new object with the ImageLocation key and value
# then add that to the object returned by this capture, which will be an object
# where each key-value pair corresponds to a named capture group, like:
# { "provider": "abc", "release": trusty, "version": "12.0.0" }
  + capture("(?# we match the start of the string)^(?# then, we grab the provider, which is all the text before the first forward-slash)(?<provider>\\w+)\/(?# after the first backslash, we ignore everything until the last occurrence of ubuntu).*ubuntu(?# next, we may optionally have a release name )(?:-(?<release>trusty|xenial|bionic|eoan))?-(?# finally, we have a numeric version or the word daily)(?<version>(?:[\\d\\.]+|daily))")

Unfortunately, the regular expression is a bit hard to read. jq uses the Oniguruma regular expression library which seems to allow comments defined using a (?#...) group. Although this is better than having no comments, it would be even better to be able to use free-spacing to allow multi-line regular expressions with comments. I’m not sure whether the problem is that Oniguruma doesn’t support this or that there’s something lacking in jq’s implementation. Below, I show the regular expression using free-spacing with additional comments and escape characters removed.

^ # we match the start of the string
# then, we grab the provider, which is all the text before the first forward-slash
(?<provider>\w+)
/
# after the first backslash, we ignore everything until the last occurrence of ubuntu
.*ubuntu
# next, we may optionally have a release name
(?:- # a dash is required
  (?<release>trusty|xenial|bionic|eoan))?
- # then a dash
# finally, we have a numeric version or the word daily
(?<version>
  (?:[\d\.]+ # the numeric version may contain digits and periods
  |daily)) # or it can be the word "daily"

Then, we can use the filter file like so:

cat short_aws_images.json \
  | jq -f ubuntu.jq_filter
{
  "ImageLocation": "099720109477/ubuntu/images/ebs/ubuntu-trusty-14.04-i386-server-20180627",
  "provider": "099720109477",
  "release": "trusty",
  "version": "14.04"
}
{
  "ImageLocation": "099720109477/ubuntu/images-testing/ebs-ssd/ubuntu-xenial-daily-amd64-server-20170823.1",
  "provider": "099720109477",
  "release": "xenial",
  "version": "daily"
}
{
  "ImageLocation": "099720109477/ubuntu/images-testing/hvm-ssd/ubuntu-xenial-daily-amd64-server-20181102",
  "provider": "099720109477",
  "release": "xenial",
  "version": "daily"
}
// more ...

Working with Elasticsearch results

Elasticsearch is a distributed full-text search engine based on Apache Solr. It has an easy to use API you can use to issue queries and retrieve data. This great article on useful Elasticsearch queries has some example data that we’ll use in our example. We’ll save this into a file called books.json.

{
  "hits": [
    {
      "_index": "bookdb_index",
      "_type": "book",
      "_id": "4",
      "_score": 1.3278645,
      "_source": {
        "title": "Solr in Action",
        "authors": [
          "trey grainger",
          "timothy potter"
        ],
        "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
        "publish_date": "2014-04-05",
        "num_reviews": 23,
        "publisher": "manning"
      }
    },
    {
      "_index": "bookdb_index",
      "_type": "book",
      "_id": "1",
      "_score": 1.2871116,
      "_source": {
        "title": "Elasticsearch: The Definitive Guide",
        "authors": [
          "clinton gormley",
          "zachary tong"
        ],
        "summary": "A distibuted real-time search and analytics engine",
        "publish_date": "2015-02-07",
        "num_reviews": 20,
        "publisher": "oreilly"
      }
    }
  ]
}

Let’s try two things with the data. Firstly, we want to transform the original result into a JSON array of authors of all books. First, we’ll get retrieve the array referenced by the hits key. Then, we change each “hit” in the array and replace it with the value of the _source.authors key. Then, we flatten the resulting multi-dimensional array. Finally, we sort the array and only return unique results. This is not necessary but will typically make the output cleaner.

cat books.json | jq '.hits | map(._source.authors) | flatten | sort | unique'
[
  "clinton gormley",
  "timothy potter",
  "trey grainger",
  "zachary tong"
]

Let’s try obtaining the total number of reviews of all of the books.

cat books.json | jq '.hits | map(._source.num_reviews) | add'

First, we get the array referenced by the hits key. Then we map over that array, returning the number of reviews. Then, we add all of those values. The answer is 43.

Conclusion

jq is an incredibly useful program that makes working with JSON data easy. It’s the missing tool for manipulating JSON results from the command-line. It’s a great tool to use in scripts especially when dealing with APIs that return JSON. jq is an incredibly rich tool with its own language and idioms. Not only can you filter and transform data, save filters in files, but there are also lots of things I didn’t talk about including working with streaming JSON, defining functions, and creating your own modules of custom filters. While it’s not the only JSON tool that you should look at, it will go a long way in filling that empty spot in your toolbox.

Looking for more content? Check out other posts with the same tags: