Describing the Policy Conditions

Now that we know how to retrieve the data the next step consists of actually writing the conditions they must validate. This is done in the policy definition. There can be only one such definition in a given policy template.

A policy definition consists of a list of validations. Each validation may in turn describe multiple checks. A validation also defines one or more Escalations that trigger when a check fails and Resolutions that trigger when an underlying issue is fixed. For further details, see Triggering Actions. Finally a validation also provides a default summary and details text templates (see Incident Message and Email Templates) used to render the incident message.

Each validate or validate_each is run independently and will generate either 0 or 1 incidents.

The syntax for a policy definition is:

policy <string literal> do

validate[_each] $<datasource>|@<resources> do

summary_template <string literal>

detail_template <string literal>

hash_include <field_name>, <field_name>, ...

hash_exclude <field_name>, <field_name>, ...

escalate $<escalation>

resolve $<resolution>

...

check <term>

...

export <path expression> do

resource_level <boolean>

field <name> do

label <string literal>

format <string literal>

path <path expression>

end

field ...

end

validate...

end

Where:

•

Each validation starts with validate or validate_each.

•

validate applies the checks on the given datasource or resources as a whole.

•

validate_each iterates over the given datasource or resources (which must be an array or is wrapped into a single element array) and applies the checks on each element.

•

summary_template provides a text template that gets applied to the escalation data to render the incident message summary.

•

detail_template provides a text template that gets applied to the escalation data to render the incident message details. It will be displayed above the export table, if one is specified.

•

hash_include is array of fields in the escalation data to check in determining whether data has changed and thus actions should be re-run. By default, all fields are checked so if any value changes at all, all actions are run again. This includes emails and cloud workflows. In general, this field does not have to be specified. The general exception is when you have a value such as a timestamp that changes constantly.

•

hash_exclude is an array of fields in the escalation data to exclude in determining whether data hash changed. This field is mutually exclusive with hash_include.

•

escalate indicates Escalations to trigger when a check fails.

•

resolve indicates Resolutions to trigger when all existing violations are resolved.

•

check identifies a term that must return anything BUT false, 0, an empty string, an empty array or an empty object. If the term returns one of these values then the check fails, an incident is created and any associated escalation triggers.

•

export controls whether or not a table of resources is exported for the incident.

•

path expression is a string literal corresponding to a jmes_path expression acting upon the violation data. The jmespath can be used to extract a table of resources if the resources exist as a subpath in data. This field is optional.

•

resource_level is a boolean stating if the data being exported is resource level data or not. If the data is resource level, available actions can be run on a select group of resources or all of them.

•

field specifies a field in the data, such as id. Each field corresponds to a column in the data table. Fields values should be simple types such as integers, strings, booleans, or arrays of simple values.

•

name is the object field key/name in the violation data row

•

label is a human readable label associated with the name and shows up as the header for the column. If omitted, name will be used.

•

format controls formatting for the column. Currently left, center, and right keywords are supported. By default, columns are left formatted.

•

path is a string literal corresponding to a jmes_path expression acting upon each resource. The jmespath can be used to extract a field from a embedded data structure or to rename a field. By default, name is used.

The policy engine runs each check in order and stops when a check fails. A check fails if the corresponding term returns false, 0, an empty string, an empty array or an empty object. In the case of validate_each the policy engine applies that algorithm for each element of the datasource or resources.

Each time a check fails the corresponding data is added to the violation data. In the case of validate this can only happen once and thus the escalation data ends up being the validated datasource or resources. In the case of validate_each this means that only the elements that fail a check are added to the escalation data.

The violation data is exported as a table in the incident view page.

Example:

Assume that the $reservations datasource has data like:

[

{

"account": {

"id": 1,

"name": "my account"

"region": "us-west-1",

"instance_type": "m1.small",

"instance_count": 10,

"end_time": "2020-01-01 01:02:03",

"time_left": 10200

...

]

policy "ri_expiration" do

validate_each $reservations do

summary_template "Reserved instances are nearing expiration."

detail_template <<-EOS

Found {{ len data }} expired reservations in account id {{ rs_project_id }}

EOS

export do

field "account_name" do

label "Account Name"

path "account.name"

end

field "account_id" do

label "Account ID"

path "account.id"

end

field "region" do

label "Region"

end

field "instance_type" do

label "Instance Type"

end

field "instance_count" do

label "Instance Count"

end

field "end_time" do

label "End Time"

end

field "time_left" do

label "Time Left In Seconds"

format "right"

end

hash_include "id", "end_time"

escalate $alert

check gt(dec(to_d(val(item, "end_time")), now), 3*24*3600))

end

In the example above the policy defines a single validation with a single check. The check returns a boolean value which is false when the duration between a reserved instance expiration data and now is less than 3 days. In this case the alert escalation triggers. The violation data consists of an array that contains all the reservations that are expiring in less than 3 days.

A table of information is defined to display in the mail as well as display on the incident show page in the dashboard.

Triggering Actions

Actions are run anytime the underlying violation data changes. By default, all fields are used in determining whether the data changes. In the case above, the time_left field will be continually changing and causing actions like email to retrigger. hash_include and hash_exclude can be used to modify this behavior by excluding certain fields form this calculation. By supplying id and end_time to the hash_include method, we ensure that we only get new alerts when one of these two values is changed. We could have also achieved the same ends by doing hash_exclude "time_left" as well -- all other fields in the datasource be relatively stable.

Only top-level fields will be considered for hash_include and hash_exclude. If you have a nested structure such as:

[

{

"id": "abc",

"config": {

"foo": "bar",

"baz": "biz"

}

]

Then you may specify hash_include "config" to detect any changes in config but not individual fields within config itself. If you wish to hash only a specific field within config such as config.foo, then use a JavaScript based script block to transform the nested fields into top-level fields first.