By Lukas Hillesheim, 25 June, 2024

Blog Article & Copyright by Lukas Hillesheim

KQL is a relatively new BI language from Microsoft and is used to retrieve data from the Azure Cloud (Azure Data Explorer (ADX) and MS Fabric). This blog article shows how to call the graph engine through KQL thereby explaining make-graph and graph-match. The article is centered on modelling and query tasks that arise in connection with a 'Bill of Material' (BOM).

Data for my examples and graphics. Data for this article was generated by my python app ndim-graph and uploaded to my database ClassroomDB on ADX. For presentation purposes data was queried with KQL embedded in ndim-graph and visualized with python networkx/pyvis.

1. Features of the KQL language

Feature	Description
Data Wrangling	Features similar to SQL: selection, filtering, join, grouping, aggregation etc.
Flexible data types	KQL data types can handle stuctured data, unstructured and complex data like nested arrays, hash tables and json data
Integrated visualization	With the render-operator data can adhoc be visualized (visuals similar to Power BI)
Integrated graph engine	Run queries on nodes and edges
KI libraries	Neural network, time series, anomalie detection, root cause analysis etc.
Python, R	Integration of Pyton- und R-scripts

2. Graph theory and neural network

2.1. Bill of Material (BOM) as tree-organized data

A single Bill of Material (BOM) is usually modelled as tree with a parent-child relationship. The following graphic exposes the hierarchical nature of the entities product und component within the tree: * a single product consists of components * A single component consists of other components * etc.

Leaf nodes can reside on different levels of a tree.

As a consequence of the internal structure of 'PROD-A', for example, the leaf nodes 'COMP-E', 'COMP-B', 'COMP-A' and 'COMP-D' are located on different levels of the tree.

Item levels:

'PROD-A' is located on level 0
'COMP-E' is located on level 1
'COMP-B' is located on level 3 etc.

This characteristic of a tree is called unbalanced.

Composition of 'PROD-A':

Fig.1: BOM as tree or graph

A database table can represent the hierarchical structure of a BOM with a parent-child relationship. The two rows show us that 'COMP-A' has a parent component named 'COMP-H':

Id	...	Name	ParentId
4	...	COMP-A	11
11	...	COMP-H	12

A tree can be considered a special case of a graph

2.2. Multiple BOM and additional entities form a graph or neural network

Due to requests of your customers it may be necessary to show the relationship between components and multiple BOMs. Addtionally other entities like customer may be part of the model. In this case, the tree model is no longer sufficient and it´s time for a graph or neural network.

The following picture shows the relationships between nodes of different entity types (customer, product and component):

A customer retrieves multiple products
A product consists of multiple components
Components consist of other components etc.

Fig.2: Graph or neural network

If you design the data modell for a graph, there is something to consider. E.g. 'COMP-A' has a relationship to its parent component 'COMP-H' as well as to another parent 'PROD-B'. With respect to the multiple relationships of 'COMP-A', the table stores two instances of 'COMP-A', each with another parent-id:

Id	...	Name	ParentId
2	...	PROD-B	NULL
4	...	COMP-A	2
4	...	COMP-A	11
11	...	COMP-H	12

3. Providing graph data

In order to process graph data, two principles should be followed:

Provide two entities node and edge
The processing of data organized in a graph or neural network is best done by a graph engine

3.1. Paradigm shift through Cloud data center

Migration from an On Premise data center to a Cloud-based data center, that is designated for all kind of data and all kind of workloads (data lake), causes a paradigm shift:

A BI stack traditionally consists of multiple source systems (oltp), a data warehouse, multiple data marts and multiple cubes. There are two main reasons that drive the separation of processing steps onto different systems:

Better performance due to dedicated hardware, dedicated configuration and dedicated modell
Each of the dedicated systems exposes an additional semantic layer. E.g. a data warehouse shows the history of data changes. This allows a representation of the state of the data for every single point in time. A data mart shows data categorized as fact and dimension. This makes it easy for users to retrieve and understand report data.

Each cloud data architect must find the best strategy whether to work directly on the data lake as mentioned above or to follow a hybrid pattern where some workloads access dedicated systems (like in the On Premise world).

3.2. KQL Walk-Through

According to my scenario, I show the second approach - in-place queries without a dedicated separate repository for graph data. Graph data is directly extracted from the ADX repository and transformed by KQL. The etl process generates 'nodes' from components, products and customers and 'edges' from transactions and parent-child information of the BOM.

3.2.1 Source tables

Source tables and their content:

Name	Content
graph_bom	Contains products and components of the BOM (parent-child hierarchy)
graph_tran	Explains the relationship between customer und product

Table 'graph_bom'

This table contains the BOM

Fig.3: Bill of Material (BOM)

Table 'graph_tran'

This table contains information about transactions.

Fig.4: Transactions

3.3. KQL code for a complete query

The following shows the KQL code for a complete query.

Section '1. Transformation: transactional layout -> graph layout' of the KQL code shows how the data from the ADX repository is prepared for graph processing.

Section '2. Graph query' of the KQL code shows how graph queries are submitted using make-graph and graph-match. (Detailed explanations will follow later)

// 1. Transformation: transactional layout -> graph-Layout: 
let nodes_bom = 
    graph_bom 
    | where Type == 'Product' or not(ParentId == '')
    | project 
        id     = toint(Id), 
        name   = Name, 
        type   = Type, 
        cost   = Cost, 
        status = Status;
let nodes_cust = 
    graph_tran
    | distinct CustomerName 
    | project 
        name = CustomerName, 
        type = 'Customer' 
    | serialize 
    | extend 
        id   = toint(row_number(100)), 
        cost = 0; 
let nodes = 
    nodes_bom 
    | union nodes_cust; 
let edges_bom = 
    graph_bom 
    | where not(ParentId == '') 
    | project 
        id = Id, 
        source_node = Name, 
        type = Type, 
        parent_id_string = ParentId
    | extend parent_id = split(parent_id_string,',')
    | mv-apply parent_id on (
        extend destination_node_id = toint(parent_id) ) 
    | lookup nodes on $left.destination_node_id == $right.id 
    | project 
        source_node, 
        destination_node = name;
let edges_tran =  
    graph_tran 
    | project 
        source_node      = ProductName, 
        destination_node = CustomerName, 
        tran_date        = TranDate;
let edges = 
    edges_bom 
    | union edges_tran; 
// 2. Graph query: 
edges 
| make-graph source_node --> destination_node with nodes on name 
| graph-match (comp) -[edge_comp_prod*1..5]-> (prod) -[edge_prod_cust*1..5]-> (cust)
  where 
    prod.name == 'PROD-A'
    and cust.name == 'CUST-A'
  project 
    customer = cust.name, 
    product = prod.name, 
    component = comp.name,
    path = edge_comp_prod.destination_node

3.4. Section 1: details

3.4.1. Node data extraction

Three steps - each with a dedicated variable - are necessary for node data extraction:

let nodes_bom ... . Nodes were derived from the entities product and component of the BOM
let nodes_cust ... . Nodes were derived from the entity customer of the transaction data
let nodes = ... union ... . nodes_bom and nodes_cust are merged together

Please note the following technical aspect: the expression let nodes_cust ... uses distinct. Since a single customer could have received multiple products, multiple instances of the customer will occur in the transaction data. In order to reduce those instances to a single instance of the respective customer, a distinct must be applied.

3.4.2. Extract edges

Edge data is extracted in three steps as well:

let edges_bom ... . Derive edges from the parent-child relationship of the BOM
let edges_tran ... . Derive edges from the transactions ('CUST-A' has received 'PROD-A' ... )
let edges_tran ... union ... . 'edges_bom' and 'edges_tran' are merged together.

Please note the following technical aspect: mv-apply is used inside let edges_bom = ... . Please have a look on the following:

The column ParentId contains the string '1, 2' that is to be understood as comma-separated list. Value 11 points to a component with the Id 11, value 2 points to a component with the id 2. That means: node 'COMP-A' is a child of two other nodes. Therefore it is necessary to derive two rows from the single one. This is done by mv-apply:

Id	Name	Type	ParentId	Cost	Status
4	COMP-A	Component	11, 2	10,000	SCHEDULED_FOR_MAINTENANCE

Id	Name	Type	ParentId	Cost	Status
4	COMP-A	Component	11	10,000	SCHEDULED_FOR_MAINTENANCE
4	COMP-A	Component	2	10,000	SCHEDULED_FOR_MAINTENANCE

4. Processing graph data with KQL

In order to get the data in a graph-like way you could develop your own custom graph engine. E.g. you could implement the graph business logic with Stored Procedures and recursive SQL. A custom solution may be a bigger challenge than expected:

Recursion and iteration may affect performance (latency due to the many round trips between cloud and custom engine)
Recursion causes complexity
Without a graph model - the existence of multiple different entities (customer, product and component) may cause a schema boundary that is difficult to overcome

The KQL graph engine provides a solution for the taks mentioned above. The tasks will be solved in a opimized way (use of the Adjacency Matrix) and allow to take advantages of the additional semantic layer. Relationships between nodes are presented in an easy-to-use fashion and can be used intuitively. In most cases only a few lines of code are necessary in order to effectively access the graph object. Examples will follow below.

5. Example 1 - KQL query

The goal of make-graph is to import nodes and edges. Conceptually the graph engine delivers a single graph object regarding all imported nodes and edges. graph-match will restrict the data and controls some details of the execution.

The following code sample is part of a complete KQL query that was presented earlier. The goal of the KQL query is to get information about the delivery of product 'PROD-A' to 'CUST-A' including all components of 'PROD-A'. Additionally the query is to show the path from the current component up to 'CUST-A'.

KQL 'graph-match' expression:

... 
edges 
| make-graph source_node --> destination_node with nodes on name 
| graph-match (comp) -[edge_comp_prod*1..5]-> (prod) -[edge_prod_cust*1..5]-> (cust)
  where 
    prod.name == 'PROD-A'
    and cust.name == 'CUST-A'
  project 
    customer = cust.name, 
    product = prod.name, 
    component = comp.name,
    path = edge_comp_prod.destination_node

Output:

Fig.5: path product-component-customer

From a conceptual point of view 'graph-match' expresses the use case.

6. Syntax of 'graph-match'

make-graph defines 1. which column of the input table [edges] is used as source-node column and 2. which column of the input table [edges] is used as destination-node column. The expression source_node --> destination_node means, that values from column [edges].[source_node] are treated as identifier for source nodes. Values from column [edges].[destination_node] are treated as identifier for destination nodes.

The expression nodes on name defines, that column [nodes].[name] is a lookup column. The identifier from [edges].[source_node] und [edges].[destination_node] point to that lookup column.

There are severall variables that are declared within make-graph. The declaration is done implicitly in a lambda expression manner:

Variables for nodes: comp, prod, cust
Variables for edges: edge_comp_prod, edge_prod_cust

The environment that calls graph-match is conceptually a loop. Row per row graph-match is called and each time data from the current row is passed to the variables

The expression -[...]-> represents an edge. Because the edge is actually pointing from the left side to the right side, source nodes reside on the left side and destination nodes reside on the right side. Since the expression (comp) -[edge_comp_prod*1..5]-> (prod) contains the variable prod on the right side, it is filled with data from destination nodes. And because the expression (prod) -[edge_prod_cust*1..5]-> (cust) contains the variable prod on the left side, it is filled with data from source nodes.

Both, node and edge, may have properties. Properties derive from the underlying columns [nodes].* or [edges].*. E.g. the expression cust.name gets its data from column [nodes].[name].

7. Example 2 - KQL query

The following example shows how KQL can help to prepare a maintenance campaign. Customers that have received products must be informed about upcoming maintenance. The query returns customer, products, components and costs.

graph-match (comp) -[comp_prod*1..5]-> (prod) -[tran*1..5]-> (cust) 
  where 
    comp.status != ""
    and getmonth(tran.tran_date) == 5