Analytical Queries on Graph Data and Neural Networks with KQL in Microsoft Fabric and Azure Data Explorer (ADX)

By Lukas Hillesheim, 25 June, 2024
Blog Article & Copyright by Lukas Hillesheim
KQL is a relatively new BI language from Microsoft and is used to retrieve data from the Azure Cloud (Azure Data Explorer (ADX) and MS Fabric). This blog article shows how to call the graph engine through KQL thereby explaining make-graph and graph-match. The article is centered on modelling and query tasks that arise in connection with a 'Bill of Material' (BOM).
Data for my examples and graphics. Data for this article was generated by my python app ndim-graph and uploaded to my database ClassroomDB on ADX. For presentation purposes data was queried with KQL embedded in ndim-graph and visualized with python networkx/pyvis.

Content:

1. Features of the KQL language

FeatureDescription
Data WranglingFeatures similar to SQL: selection, filtering, join, grouping, aggregation etc.
Flexible data typesKQL data types can handle stuctured data, unstructured and complex data like nested arrays, hash tables and json data
Integrated visualizationWith the render-operator data can adhoc be visualized (visuals similar to Power BI)
Integrated graph engineRun queries on nodes and edges
KI librariesNeural network, time series, anomalie detection, root cause analysis etc.
Python, RIntegration of Pyton- und R-scripts

2. Graph theory and neural network

2.1. Bill of Material (BOM) as tree-organized data

A single Bill of Material (BOM) is usually modelled as tree with a parent-child relationship. The following graphic exposes the hierarchical nature of the entities product und component within the tree: * a single product consists of components * A single component consists of other components * etc.
Leaf nodes can reside on different levels of a tree.
As a consequence of the internal structure of 'PROD-A', for example, the leaf nodes 'COMP-E', 'COMP-B', 'COMP-A' and 'COMP-D' are located on different levels of the tree.
Item levels:
  • 'PROD-A' is located on level 0
  • 'COMP-E' is located on level 1
  • 'COMP-B' is located on level 3 etc.
This characteristic of a tree is called unbalanced.
Composition of 'PROD-A':
BOM as tree or graph
Fig.1: BOM as tree or graph
A database table can represent the hierarchical structure of a BOM with a parent-child relationship. The two rows show us that 'COMP-A' has a parent component named 'COMP-H':
Id...NameParentId
4...COMP-A11
11...COMP-H12
A tree can be considered a special case of a graph

2.2. Multiple BOM and additional entities form a graph or neural network

Due to requests of your customers it may be necessary to show the relationship between components and multiple BOMs. Addtionally other entities like customer may be part of the model. In this case, the tree model is no longer sufficient and it´s time for a graph or neural network.
The following picture shows the relationships between nodes of different entity types (customer, product and component):
  • A customer retrieves multiple products
  • A product consists of multiple components
  • Components consist of other components etc.
Graph or neural network
Fig.2: Graph or neural network
If you design the data modell for a graph, there is something to consider. E.g. 'COMP-A' has a relationship to its parent component 'COMP-H' as well as to another parent 'PROD-B'. With respect to the multiple relationships of 'COMP-A', the table stores two instances of 'COMP-A', each with another parent-id:
Id...NameParentId
2...PROD-BNULL
4...COMP-A2
4...COMP-A11
11...COMP-H12

3. Providing graph data

In order to process graph data, two principles should be followed:
  • Provide two entities node and edge
  • The processing of data organized in a graph or neural network is best done by a graph engine

3.1. Paradigm shift through Cloud data center

Migration from an On Premise data center to a Cloud-based data center, that is designated for all kind of data and all kind of workloads (data lake), causes a paradigm shift:
A BI stack traditionally consists of multiple source systems (oltp), a data warehouse, multiple data marts and multiple cubes. There are two main reasons that drive the separation of processing steps onto different systems:
  • Better performance due to dedicated hardware, dedicated configuration and dedicated modell
  • Each of the dedicated systems exposes an additional semantic layer. E.g. a data warehouse shows the history of data changes. This allows a representation of the state of the data for every single point in time. A data mart shows data categorized as fact and dimension. This makes it easy for users to retrieve and understand report data.
Each cloud data architect must find the best strategy whether to work directly on the data lake as mentioned above or to follow a hybrid pattern where some workloads access dedicated systems (like in the On Premise world).

3.2. KQL Walk-Through

According to my scenario, I show the second approach - in-place queries without a dedicated separate repository for graph data. Graph data is directly extracted from the ADX repository and transformed by KQL. The etl process generates 'nodes' from components, products and customers and 'edges' from transactions and parent-child information of the BOM.

3.2.1 Source tables

Source tables and their content:
NameContent
graph_bomContains products and components of the BOM (parent-child hierarchy)
graph_tranExplains the relationship between customer und product

Table 'graph_bom'

This table contains the BOM
Bill of Material (BOM)
Fig.3: Bill of Material (BOM)

Table 'graph_tran'

This table contains information about transactions.
Transactions
Fig.4: Transactions

3.3. KQL code for a complete query

The following shows the KQL code for a complete query.
Section '1. Transformation: transactional layout -> graph layout' of the KQL code shows how the data from the ADX repository is prepared for graph processing.
Section '2. Graph query' of the KQL code shows how graph queries are submitted using make-graph and graph-match. (Detailed explanations will follow later)
// 1. Transformation: transactional layout -> graph-Layout: 
let nodes_bom = 
    graph_bom 
    | where Type == 'Product' or not(ParentId == '')
    | project 
        id     = toint(Id), 
        name   = Name, 
        type   = Type, 
        cost   = Cost, 
        status = Status;
let nodes_cust = 
    graph_tran
    | distinct CustomerName 
    | project 
        name = CustomerName, 
        type = 'Customer' 
    | serialize 
    | extend 
        id   = toint(row_number(100)), 
        cost = 0; 
let nodes = 
    nodes_bom 
    | union nodes_cust; 
let edges_bom = 
    graph_bom 
    | where not(ParentId == '') 
    | project 
        id = Id, 
        source_node = Name, 
        type = Type, 
        parent_id_string = ParentId
    | extend parent_id = split(parent_id_string,',')
    | mv-apply parent_id on (
        extend destination_node_id = toint(parent_id) ) 
    | lookup nodes on $left.destination_node_id == $right.id 
    | project 
        source_node, 
        destination_node = name;
let edges_tran =  
    graph_tran 
    | project 
        source_node      = ProductName, 
        destination_node = CustomerName, 
        tran_date        = TranDate;
let edges = 
    edges_bom 
    | union edges_tran; 
// 2. Graph query: 
edges 
| make-graph source_node --> destination_node with nodes on name 
| graph-match (comp) -[edge_comp_prod*1..5]-> (prod) -[edge_prod_cust*1..5]-> (cust)
  where 
    prod.name == 'PROD-A'
    and cust.name == 'CUST-A'
  project 
    customer = cust.name, 
    product = prod.name, 
    component = comp.name,
    path = edge_comp_prod.destination_node

3.4. Section 1: details

3.4.1. Node data extraction

Three steps - each with a dedicated variable - are necessary for node data extraction:
  • let nodes_bom ... . Nodes were derived from the entities product and component of the BOM
  • let nodes_cust ... . Nodes were derived from the entity customer of the transaction data
  • let nodes = ... union ... . nodes_bom and nodes_cust are merged together
Please note the following technical aspect: the expression let nodes_cust ... uses distinct. Since a single customer could have received multiple products, multiple instances of the customer will occur in the transaction data. In order to reduce those instances to a single instance of the respective customer, a distinct must be applied.

3.4.2. Extract edges

Edge data is extracted in three steps as well:
  • let edges_bom ... . Derive edges from the parent-child relationship of the BOM
  • let edges_tran ... . Derive edges from the transactions ('CUST-A' has received 'PROD-A' ... )
  • let edges_tran ... union ... . 'edges_bom' and 'edges_tran' are merged together.
Please note the following technical aspect: mv-apply is used inside let edges_bom = ... . Please have a look on the following:
The column ParentId contains the string '1, 2' that is to be understood as comma-separated list. Value 11 points to a component with the Id 11, value 2 points to a component with the id 2. That means: node 'COMP-A' is a child of two other nodes. Therefore it is necessary to derive two rows from the single one. This is done by mv-apply:
IdNameTypeParentIdCostStatus
4COMP-AComponent11, 210,000SCHEDULED_FOR_MAINTENANCE
IdNameTypeParentIdCostStatus
4COMP-AComponent1110,000SCHEDULED_FOR_MAINTENANCE
4COMP-AComponent210,000SCHEDULED_FOR_MAINTENANCE

4. Processing graph data with KQL

In order to get the data in a graph-like way you could develop your own custom graph engine. E.g. you could implement the graph business logic with Stored Procedures and recursive SQL. A custom solution may be a bigger challenge than expected:
  • Recursion and iteration may affect performance (latency due to the many round trips between cloud and custom engine)
  • Recursion causes complexity
  • Without a graph model - the existence of multiple different entities (customer, product and component) may cause a schema boundary that is difficult to overcome
The KQL graph engine provides a solution for the taks mentioned above. The tasks will be solved in a opimized way (use of the Adjacency Matrix) and allow to take advantages of the additional semantic layer. Relationships between nodes are presented in an easy-to-use fashion and can be used intuitively. In most cases only a few lines of code are necessary in order to effectively access the graph object. Examples will follow below.

5. Example 1 - KQL query

The goal of make-graph is to import nodes and edges. Conceptually the graph engine delivers a single graph object regarding all imported nodes and edges. graph-match will restrict the data and controls some details of the execution.
The following code sample is part of a complete KQL query that was presented earlier. The goal of the KQL query is to get information about the delivery of product 'PROD-A' to 'CUST-A' including all components of 'PROD-A'. Additionally the query is to show the path from the current component up to 'CUST-A'.

KQL 'graph-match' expression:

... 
edges 
| make-graph source_node --> destination_node with nodes on name 
| graph-match (comp) -[edge_comp_prod*1..5]-> (prod) -[edge_prod_cust*1..5]-> (cust)
  where 
    prod.name == 'PROD-A'
    and cust.name == 'CUST-A'
  project 
    customer = cust.name, 
    product = prod.name, 
    component = comp.name,
    path = edge_comp_prod.destination_node

Output:

Path product-component-customer
Fig.5: path product-component-customer
From a conceptual point of view 'graph-match' expresses the use case.

6. Syntax of 'graph-match'

make-graph defines 1. which column of the input table [edges] is used as source-node column and 2. which column of the input table [edges] is used as destination-node column. The expression source_node --> destination_node means, that values from column [edges].[source_node] are treated as identifier for source nodes. Values from column [edges].[destination_node] are treated as identifier for destination nodes.
The expression nodes on name defines, that column [nodes].[name] is a lookup column. The identifier from [edges].[source_node] und [edges].[destination_node] point to that lookup column.
There are severall variables that are declared within make-graph. The declaration is done implicitly in a lambda expression manner:
  • Variables for nodes: comp, prod, cust
  • Variables for edges: edge_comp_prod, edge_prod_cust
The environment that calls graph-match is conceptually a loop. Row per row graph-match is called and each time data from the current row is passed to the variables
The expression -[...]-> represents an edge. Because the edge is actually pointing from the left side to the right side, source nodes reside on the left side and destination nodes reside on the right side. Since the expression (comp) -[edge_comp_prod*1..5]-> (prod) contains the variable prod on the right side, it is filled with data from destination nodes. And because the expression (prod) -[edge_prod_cust*1..5]-> (cust) contains the variable prod on the left side, it is filled with data from source nodes.
Both, node and edge, may have properties. Properties derive from the underlying columns [nodes].* or [edges].*. E.g. the expression cust.name gets its data from column [nodes].[name].

7. Example 2 - KQL query

The following example shows how KQL can help to prepare a maintenance campaign. Customers that have received products must be informed about upcoming maintenance. The query returns customer, products, components and costs.
graph-match (comp) -[comp_prod*1..5]-> (prod) -[tran*1..5]-> (cust) 
  where 
    comp.status != ""
    and getmonth(tran.tran_date) == 5

Output:

node- and edge-properties
Fig.6: node- and edge-properties

8. Visualization

Visualization in MS Fabric and ADX can be achieved in three ways:
  • Visualization through the integrated features of the respective service. The visualization process is triggered by the render-command
  • Visualization through the KQL client. E.g. Power BI provides a KQL client driver and is able to submit KQL queries. At the moment Power BI has no built-in graph visual. KQL results must be displayed with the help of a table visual, line chart visual etc. The only product with integrated graph capabilities is Azure Monitor
  • Visualization can also be implemented by custom solutions. E.g. the graphics of this article were made with python pyvis

9. Python - KQL query and visualization

A custom python app can do the following steps:
  • Install networkx and pyvis
  • Access data on Fabric, ADX or another KQL service
  • Run a KQL query
  • Receive the result and visualize it

Python: KQL query:

from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table

KCSB = KustoConnectionStringBuilder.with_aad_device_authentication(
        KUSTO_CLUSTER)
KCSB.authority_id = '%lt;place_your_tenant_here$gt;'
KUSTO_CLIENT = KustoClient(KCSB)
KUSTO_QUERY = '%lt;place_your_query_here$gt;'
RESPONSE = KUSTO_CLIENT.execute('ClassroomDB', KUSTO_QUERY)
df = dataframe_from_result_table(RESPONSE.primary_results[0])

Python: Visualization with 'pyvis':

from pyvis import network as net
from IPython.display import display, HTML

network.add_node(...) 
network.add_edge(...)