How Graph Neural Networks and Code Property Graphs Analyze True Data Flow to Eliminate SAST False Positives

Static Application Security Testing (SAST) has long been a cornerstone of secure software development. By analyzing source code without executing it, SAST tools can identify vulnerabilities early in the development lifecycle. However, traditional SAST solutions are notorious for producing a large number of false positives—warnings that flag code as vulnerable even when it is safe in real execution contexts. These false positives slow down development, erode trust in security tools, and distract engineers from real risks.

Recent advances in program analysis and machine learning have introduced a powerful combination: Code Property Graphs (CPGs) and Graph Neural Networks (GNNs). Together, they enable a far more precise understanding of true data flow—how data actually moves through a program across functions, conditions, and transformations. This article explores how these technologies work together to drastically reduce SAST false positives, with practical coding examples to illustrate the difference between naive static analysis and graph-based reasoning.

The Root Cause of False Positives in Traditional SAST

Most legacy SAST tools rely on rule-based pattern matching and simplified data-flow heuristics. While effective at detecting obvious issues, these techniques struggle with real-world code complexity.

Consider a common example involving user input and SQL queries:

String userInput = request.getParameter("id");
String query = "SELECT * FROM users WHERE id = " + userInput;
executeQuery(query);

A traditional SAST tool will correctly flag this as a potential SQL injection vulnerability. However, the problem arises when the code becomes slightly more complex:

String userInput = request.getParameter("id");
int safeId = Integer.parseInt(userInput);
String query = "SELECT * FROM users WHERE id = " + safeId;
executeQuery(query);

Many SAST tools still flag this code because they track tainted variables syntactically rather than semantically. They see data originating from an untrusted source and flowing into a sensitive sink, but they fail to understand that Integer.parseInt() enforces a numeric constraint that neutralizes injection risk.

This lack of semantic and contextual awareness is the fundamental reason for SAST false positives.

What Is True Data Flow?

True data flow refers to the actual runtime behavior of data as it propagates through a program, rather than a simplified approximation based on syntax alone. It accounts for:

Data transformations and sanitization
Control-flow conditions
Interprocedural function calls
Aliasing and object references
Type constraints and implicit guarantees

True data flow answers questions like:

Is this value still attacker-controlled at this point?
Has the data been validated, constrained, or overwritten?
Does this execution path even reach the vulnerable sink?

Capturing these realities requires a richer representation of code than simple abstract syntax trees or token streams.

Code Property Graphs: A Unified Semantic Representation

A Code Property Graph (CPG) is a graph-based representation that unifies multiple views of a program into a single structure. Instead of analyzing syntax, control flow, and data flow separately, a CPG merges them into one graph.

A CPG typically combines:

Abstract Syntax Tree (AST): The syntactic structure of the code
Control Flow Graph (CFG): Possible execution paths
Program Dependence Graph (PDG): Data and control dependencies

Each node represents a program element (variables, expressions, method calls), and edges represent relationships such as “flows into,” “controls,” or “depends on.”

For example, in the following code:

def process(user_input):
    if user_input.isdigit():
        value = int(user_input)
        return value * 2
    return 0

The CPG captures:

The conditional guard (isdigit())
The conversion (int(user_input))
The dependency between the condition and the assignment
The fact that value is only defined on a safe execution path

This unified graph allows security analysis to reason about whether unsafe data can actually reach sensitive operations.

How Code Property Graphs Improve SAST Precision

Using CPGs, SAST tools can move from shallow pattern matching to structural reasoning.

Consider a false positive-prone example:

function saveFile(filename) {
    if (!filename.includes("..")) {
        fs.writeFileSync("/safe/dir/" + filename, "data");
    }
}

A naive SAST rule might flag this as a path traversal vulnerability because user-controlled input is used in a file path. However, a CPG-based analysis can see:

The control dependency on the if condition
The sanitization logic excluding ".." patterns
The fact that the sink is only reachable on validated paths

By modeling these relationships explicitly, CPGs eliminate many false positives caused by ignoring control flow and validation logic.

Why Graph Neural Networks Are a Natural Fit

While CPGs provide rich structural data, analyzing them at scale is challenging. Real-world programs produce massive graphs with thousands or millions of nodes. This is where Graph Neural Networks (GNNs) come in.

GNNs are machine learning models designed to learn patterns directly from graph structures. Unlike traditional neural networks, GNNs propagate information along edges, allowing each node to learn from its neighbors.

In the context of security analysis, GNNs can learn:

What safe data flows look like
How sanitization functions affect taint propagation
Which patterns consistently lead to real vulnerabilities

Rather than relying on hand-written rules, GNNs learn these behaviors from labeled examples of vulnerable and non-vulnerable code.

Learning True Data Flow with GNNs

When a GNN is applied to a CPG, each node (such as a variable or function call) starts with an embedding representing its properties: type, role, source, or sink classification. Through message passing, nodes exchange information with neighbors.

For example, consider this Java code:

String input = request.getParameter("age");
int age = Integer.parseInt(input);
if (age > 18) {
    grantAccess(age);
}

A GNN can learn that:

request.getParameter() is a taint source
Integer.parseInt() constrains input
Numeric comparison further restricts unsafe values
The sink (grantAccess) receives a safe, bounded value

Instead of blindly propagating taint, the GNN learns when taint is effectively neutralized based on structural and semantic cues in the graph.

Reducing False Positives in Interprocedural Analysis

One of the hardest problems in SAST is interprocedural analysis—tracking data flow across multiple functions and files.

char* sanitize(char* input) {
    remove_special_chars(input);
    return input;
}

void handle(char* userInput) {
    char* safe = sanitize(userInput);
    executeCommand(safe);
}

Traditional SAST tools often struggle here because they either:

Do not inline function behavior accurately, or
Treat custom sanitization functions as unknown

With CPGs, the sanitize function is represented as a subgraph. A GNN can learn from training data that certain transformations reduce risk, even when the function name is not on a predefined whitelist.

This allows the system to generalize beyond known sanitizers and significantly reduce false positives in large, modular codebases.

Context-Aware Vulnerability Classification

Another advantage of combining CPGs and GNNs is context awareness. The same data flow can be safe in one context and dangerous in another.

query = "SELECT * FROM users WHERE id = %s" % user_id
cursor.execute(query)

Versus:

query = "SELECT * FROM users WHERE id = " + user_id
cursor.execute(query)

A GNN trained on CPGs can learn the subtle structural differences:

Parameterized query usage
String concatenation vs. binding
API call semantics

This contextual understanding is extremely difficult to encode with static rules but comes naturally to graph-based learning.

Practical Impact on Developer Productivity

Reducing false positives is not just a theoretical improvement—it has real-world consequences. When developers trust SAST results, they are more likely to:

Fix issues earlier
Integrate security checks into CI pipelines
Respond quickly to real vulnerabilities

By focusing alerts on high-confidence findings derived from true data flow, CPG- and GNN-based systems dramatically improve signal-to-noise ratio.

Conclusion

The persistent problem of false positives has long limited the effectiveness of Static Application Security Testing. Traditional SAST tools, constrained by rule-based heuristics and shallow data-flow approximations, often fail to reflect how programs actually behave at runtime. As software systems grow more complex, these limitations become increasingly costly.

The combination of Code Property Graphs and Graph Neural Networks represents a fundamental shift in how static security analysis is performed. Code Property Graphs provide a unified, semantically rich representation of code that captures syntax, control flow, and data dependencies in a single structure. This alone enables more accurate reasoning about how data moves through a program and under what conditions it reaches sensitive operations.

Graph Neural Networks build on this foundation by learning patterns of true data flow directly from real code. Instead of relying on brittle rules, GNNs infer when data has been sufficiently validated, constrained, or transformed. They understand context, interprocedural behavior, and subtle structural cues that distinguish real vulnerabilities from benign code patterns.

Together, these technologies move SAST from a conservative, noisy approximation toward a precise, intelligent analysis of real program behavior. The result is fewer false positives, higher developer trust, and more effective security outcomes. As adoption grows, CPG- and GNN-based analysis is poised to redefine what “static” security testing can achieve—bringing it closer than ever to understanding code the way developers actually write and execute it.