DB, ElasticSearch, no-SQL

ElasticSearch: Performing aggregations and sub aggregations on filtered children of filtered parents with a single query


Edit: Post refers to ES 1.x. Queries with 2.x and 5.x are pretty much the same, only with couple of changes in the filtered/query structure which were deprecated. Check out the breaking changes in the official website.

On a project I’m currently working on, I wrote a propriety ODM for ES, one of the function this ODM had to support was aggregations and sub aggregations on filtered children of filtered parents. The problem was that ES docs provided examples for each separate query (aggregation, sub aggregation, has_child, has_parent), but it didn’t really explained properly combine all of them together. So the situation was that I had a clear idea of the API the ODM will expose, but I wasn’t sure about the request tree it should generate to pass to ES. So after some reasearch, I thought I’d share my way to do it.

Query definition

Lets assume that we have an index of book_store which have 2 types, agency and books. Our requirement is to get all books which were published, but not yet sold out, belonging to an agency named “Marvel” with whom we don’t have any legal issues, and get total sales, per book, per month.

This is the full query, I’ve simplified it in terms of conditions and the aggregations amount in order to make the request tree clearer. We’ll go over each part of it soon enough.

GET book_store/book/_search
{
  "query": {
    "filtered": {
      "query": {
        "filtered": {
          "filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "published": true
                  }
                }
              ],
              "must_not": [
                {
                  "term": {
                    "sold_out": true
                  }
                }
              ]
            }
          }
        }
      },
      "filter": {
        "has_parent": {
          "type": "agency",
          "query": {
            "filtered": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "term": {
                        "name": "Marvel"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "exists": {
                        "field": "legal_issue_id"
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "aggs": {
    "book_name": {
      "terms": {
        "field": "book_name"
      },
      "aggs": {
        "publish_months": {
          "date_histogram": {
            "field": "publish_date",
            "interval": "month"
          },
          "aggs": {
            "sales": {
              "value_count": {
                "field": "books_sold"
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}

Step by step

As you can see, once you figure out the correct structure the query is pretty straightforward. At first, we a set up a query on the type we want to work with (lines 2-26), we’re doing multiple filters query by wrapping them in a boolean query to get only the books which are published but not yet sold out.

Next, we are filtering the matches in the query by has_parent. So make sure you’ve defined child-parent relationship between your types. Between lines 27-53 we define that our matches must have an agency parent, and this agency parent also filtered by boolean query, which states that the agency must be “Marvel” and we must not have any legal issue with it.

Aggregating on the filtered data

Up until line 56 we defined the desired result filters and matches that on which we will be performing our aggregations. Now let’s examine the desired aggregations: As you probably know ES allows us to nest aggregations, this way we can perform aggregation on already aggregated data.

  1. Lines 57-59 – Will perform aggregation by book name.
  2. Lines 62-64 – Will perform date histogram aggregation on each book name, the date aggregation buckets will be listed under each book name aggregation.
  3. Lines 68-70 – Will take every date range in the buckets created by the date histogram aggregation, and under each date list the count of the sold books.

So basically now you have the aggregation of sold books, per date range, per book. That’s all.

P.S, you may have noticed that I’ve written that this is my way of executing this query. While i come up with this query after some strenuous research with ES docs, I don’t rule out that it maybe somehow optimized a little bit. So if anyone have any better idea it sure is welcome.

One thought on “ElasticSearch: Performing aggregations and sub aggregations on filtered children of filtered parents with a single query

  1. I spent hours looking looking for *a* way to do it, not to mention the most correct way. Thanks for sharing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s