Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

VIATRA/Query/FAQ

< VIATRA‎ | Query(Redirected from EMFIncQuery/FAQ)

Frequently Asked Questions

Common issues

What does VIATRA Query call the 'qualified name' of a graph pattern?

It is actually the fully qualified name, which is the qualified name of the containing package, a separator dot, and the simple name in the pattern (the one you define it with after the pattern keyword). The qualified name of the package has dot-separated segments, like in Java.

What is the (computational) complexity of pattern matching?

In our typical use cases, as patterns are relatively small, and their constraints are pretty restrictive, therefore the size of the match set does not grow combinatorically, as predicted by the theoretical worst case scenario.

The total memory usage (of the EMF application plus VIATRA Query) will be the size of the model + the size of the match set, and actually slightly more than that because some intermediate results are also cached and incrementally maintained. As of now, we do not carry out any deep query optimization, so in some corner cases these "intermediate results" can grow much bigger than the match set - although usually they don't. A good advice here is to look for small, connected parts of patterns that occur more than once, and refactor them into a separate helper pattern that will be called from all the original occurrences using the 'find' keyword. We have summarized best practices on how to avoid problems on the performance page.

Query evaluation time is pretty much instantaneous. Actually at low level, the result set has has to be copied, but then usually you will want to iterate over it anyways, so this won't be the dominant cost. Note that getOneMatch() can be much faster than getAllMatches(). If you have some bound input parameters (as opposed to retrieving all matches globally), then the restricted result set is even smaller, and is accessed with a cheap hash-based lookup.

Update time (i.e. model manipulation overhead) is related to how many new matches appear or old ones disappear due to the modification, and also the amount of change in the internal caches and indices. Most typically, a single change in the model only makes a very limited amount of change in the match set and the internal caches, and is therefore cheap.

Initialization time is composed of reading the model and then filling up the caches and indices accordingly. The latter one is basically the update overhead on model element creation times the size of the model, as the mechanism is almost the same as the one that will maintain the cache afterwards upon model updates. As for model reading, the current version (see below) traverses the entire model once, when the first matcher is constructed for the given pattern and EMF model. Depending on pattern contents, a re-traversal may be required for newly registered patterns -- this behavior will change in the near future for a more flexible approach, whereby the developer will be able to identify batches of patterns that should be initialized together (see also the next question).


How and when are the match set caches initialized?

You can attach a pattern matcher engine on an EMF model root (preferably EMF ResourceSet or Resource, but potentially any containment subtree) at any time (that is, even before any contents have been loaded). In the current version, at most one pattern matcher engine is built for each of these EMF "roots" (this is true per IncQueryEngine; you can create separate "unmanaged" engines that will not share pattern matcher caches with the default "managed" engine). It is constructed when you first initialize a pattern matcher on that root, and the next time you instantiate a pattern matcher (for the same or a different pattern) on the same root, it will reuse the underlying engine. However, if the second pattern uses model features (e.g. element types) that were irrelevant for the first pattern, then the engine must re-traverse the model to gather additional information.

With optimum performance in mind, you need to consider the followings:

  • If you initialize several patterns using various model elements on a (large) model that has already been loaded, there could be many repeated "model read" traversals over the entire ResourceSet (depending on the contents of the patterns). In exchange, memory is allocated by VIATRA Query only gradually, as you initialize the matchers step-by-step.
  • In order to avoid the model (re)traversal for initialization, another option is to initialize all VIATRA Query matchers your application will use on a Resource(Set) before its contents have been loaded. This way, no additional initialization overhead is experienced due to pattern matcher initialization operations, however all memory is allocated at once.
  • Alternatively in wildcard mode, the VIATRA Query engine will cache all EObject and reference types during the first model traversal, which means that independently of the contents your patterns, no further retraversal will be necessary. This mode is on by default for development time, with an option to opt-out in case you need to work with large models. If you want to use wildcard mode during runtime, refer to the API Javadoc.
  • VIATRA Query can now handle groups (batches) of patterns together: a new API feature allows to initliaze a (freely defined) group of patters together in one go, without needlessly traversing the model several times. The development environment will treat patterns residing in the same .vql file as a group. During runtime, you can compose a PatternGroup however you will, with built-in support for the group of registered (generated) patterns declared within a single package. The code generator also generates a group from each .vql source file.

Does VIATRA Query need to load all the models into memory, or only the necessary ones like EMF Model Query 2?

VIATRA Query was primarily designed to define and execute queries against models that are already in the memory. If you initialize a "matcher" on a ResourceSet that has already been filled with the contents of the model, as addressed in the previous question, VIATRA Query will perform an exhaustive model traversal, and while doing so, it will trigger the loading of any referenced external resources. If the model is changed at a later time to refer to additional external resources, they will be loaded into the ResourceSet as well.

In the development environment, the Query Explorer initializes the pattern matchers for the entire ResourceSet (used by the host editor from which the model is being used). In order to better support working with fragmented models, we support a feature whereby the developer has ability to restrict the matcher initialization to only the main Resource of host editor by selecting an alternative action for the green button of the Query Explorer. (Note that API even support setting the matcher scope to the containment subtree below any EObject.)

In summary, VIATRA Query does not currently concern itself with querying "workspace" models (i.e. models that are not loaded into memory inside some editor, for instance). In this sense, it is complementary to Model Query 2. However, Model Query 2 could be extended to use VIATRA Query-based model indexers, to speed things up considerably. This is a feature that we plan to implement sometime in the future. Q: What is included in the query results if the matcher is attached to an EMF model root that is not the entire ResourceSet, just some Resource or containment subtree?

Every EObject in the containment subtree below the selected EMF model root will be considered for pattern matching, as well as their attributes and the EReferences interconnecting between them. EReferences pointing outward from the subtree, as well as the elements they are directly pointing to, currently may or may not be considered (depending on complicated things), so do not assume either case. Nothing else will be considered. Q: What is included in the query results if the scope of the matcher is the entire ResourceSet?

You do not have to worry about any of the above if the scope is given as the entire ResourceSet - in this case, all EObjects and their attributes and references will be considered, regardless in which Resource of the Set they reside in. If the initially loaded Resources contain references to external Resources, they too will be resolved and loaded within the ResourceSet.

However, there are a few other exceptions discussed in the next point.

What else is included in the query results? Why do I see the contents of some external EPackages, such as Ecore?

If the EMF model root is a ResourceSet, the scope of query evaluation will also include external resources that are refererred from the ResourceSet but not attached to any ResourceSet. (This does not happen normally, as ResourceSets are typically closed w.r.t. references.) A frequent occurrence of this phenomenon is nsURI-based references to metamodel elements in registered EPackages - e.g. from .ecore models. This is why you might see referenced EPackages appearing in the results, when you run queries against an .ecore model. In fact, you might see some duplicate EPackages as well, if there are nsURI-based and "platform:"-prefixed references alongside each other - which is the correct result, as these will be separate objects.

How does one use attributes?

Use the EAttributes as path expressions in the pattern definition to navigate from the EObject to its attribute value. Let's say the constraint House.owner.name(ThisHouse, OwnerName); binds the variable OwnerName is now bound to an attribute value (of type java.lang.String, more precisely EString). Afterwards, the raw value can be directly used in a check() condition, or in any other pattern constraint, or even as a parameter variable. The equality of two attribute values can be asserted by a '==' constraint between the two variables, such as MyAge == YourAge; or even by using the same variable in both path expressions. For inequality, the operator '!=' is provided; for more complex attribute checks, use a check() expression.

How are null-valued features represented in the query language?

Unset or null-valued attributes (or references) simply won't match, as there is no referenced EObject or attribute value to substitute in the target pattern variable. If you are especially looking for these, use a negative application condition (neg find hasXYZ(...)).

Can patterns be defined recursively?

Theory: the language does not forbid the usage of recursive patterns, however great care should be taken when using this feature. If the recursion is not well-founded, i.e. matches can circularly support each other, then the result may be different from what you expect (technically, the matcher does not observe minimal fixpoint semantics, the fixpoint it stabilizes upon may be non-minimal).

Practice: most of the time, people want to write recursive patterns to evaluate some kind of transitive closure. If you can, just use the built-in transitive closure operator (find myPattern+(A,B)), and then you have nothing to worry about. If your use case is too complex, you can experiment with recursive patterns on your own risk; if the model graph itself is DAG (acyclic) w.r.t to the edges that your pattern traverses, you should be fine.

Performance optimization guidelines

On this page, we aim to summarize our experiences regarding peformance benchmarking with model transformation and model query tools. In particular, we attempt to provide advice on how to do accurate performance benchmarking and avoid typical pitfalls. We also aim to answer frequently asked questions regarding our technologies and performance/scalability/usability/functionality issues. Finally, we provide a detailed list of references to all academic papers, reports and supplementary material that are related to performance/scalability experiments.

Our most important goals with this page are transparency and reproducibility, that is, to provide precise descriptions, code and model examples, and evaluation guidelines that anyone can use to reproduce and check VIATRA and VIATRA Query for performance and scalability.

Basics

The most important configuration step is to ensure that the Java Virtual Machine (JVM) running the Eclipse environment (and VIATRA/VIATRA Query inside) has access to as much memory (RAM) as possible. The reason for this is that by default, the JVM is not configured (by the settings in eclipse.ini) to be able to use all the available RAM in your computer. If the Eclipse application uses up all memory within the - rather low - default limit, thrashing and other kinds of performance degradation might occur, potentially corrupting performance measurement results.

For information on how to specify JVM boot parameters in Eclipse, we refer the reader to:

For Eclipse applications, a performance benchmark setup typically requires the appropriate setting of two boot JVM parameters:

  • maximum heap size: -XmxHEAPSIZEm (larger is better)
    • e.g. -Xmx2048m (for a 2GB heap limit)
    • if you wish to use VIATRA Query or VIATRA with large instance models (>100MB in serialized size), specify a limit which is as close as the physical RAM in your computer as possible
  • maximum permgen space: -XX:MaxPermSize=PERMSIZEm
    • e.g. -XX:MaxPermSize=256m (for a 256M permgen space limit)

There are a number of other JVM boot parameters as well, which might have a beneficial effect on overall performance. On 64 bit systems, we recommend to use the followings:

  • -XX:+UseCompressedOops
  • -XX:-UseParallelGC

Best practices

In the followings, we summarize our recommendations for conducting performance benchmarks. These tips apply not just to VIATRA or VIATRA Query, but to any other (modeling) tool as well.

For query/pattern matching performance, focus your measurements strictly on the query/pattern matching execution phase. In other words, try to avoid including other activities (such as model initialization, the printing of debug/output information to standard output etc.) in the recorded execution time values. For instance, emitting textual output may have considerable overhead (e.g. as is the case in VIATRA, due to the rather complex formatting/output buffering infrastructure in place, to support advanced code generation use-cases) that have nothing to do with the (pure) performance of query evaluation/pattern matcher.

Measure wall times, preferably with System.nanotime() or something similar, for maximum accuracy. Whenever possible (especially with open source tools), use source code instrumentation (or simply adding a few lines of code to the source) to precisely isolate the execution phases of interest. As observed e.g. in our Train Benchmarks, the specific lifecyle of incremental pattern matching (that is, the overhead on model initialization and modification operations) mean that various use-cases (such as the "morning boot", i.e. loading the model for the first time, or "reboot", i.e. the re-execution of queries or transformations after they have been executed previously) may have characteristically different speed that are practical to be measured separately from each other.

A simple example illustrating this technique with VIATRA Query is as follows:

long start = System.nanoTime();
MatchedClassMatcher matcher = MatchedClassMatcher.FACTORY.getMatcher(emfRoot); 
// initialization phase, the Rete network is constructed (involves model traversal)
long matcherInit = System.nanoTime();    
Collection matches = matcher.getAllMatchesAsSignature();
// pattern matching phase, results are retrieved from the Rete network    
long collectedMatches = System.nanoTime();
System.out.println("Init took: " + (matcherInit-start)/1000000 + 
 " Collecting took: " + (collectedMatches-matcherInit)/1000000 + " ms");

Take the average of at least 10 runs, excluding the worst and best results. Due to frequently encountered auxiliary distortion effects such as OS-level caching, or JVM-level class loading, we usually perform several (at least 10) measurement runs, leave out the best and worst results, and take the average of the remaining data. For the Train Benchmarks, we even have taken special care (relying on specific features of the Linux kernel) to disable OS-level caching effects since the speed of model loading/initialization phases (especially for very large models) may also significantly depend on such low-level features.

Take special care for measuring memory overhead. Measuring the memory usage of Java programs is widely known to be a difficult task. For the Rete-based incremental pattern matchers in VIATRA Query and VIATRA, it is relatively straightforward to define the memory overhead as the "retained" (steady-state) memory usage that is registered after a query has been evaluated on an instance model (since Rete maintains an in-memory cache that is kept in-sync with the model as long as it is explicitly disposed or the model itself is disposed).

To measure this, in simple measurements, we commonly use the following code snippet to report the current memory usage of the JVM:

System.gc();
System.gc();
System.gc();
System.gc();
System.gc();
   
try {
  Thread.sleep(1000); // wait for the GC to settle
 } catch (InterruptedException e) { // TODO handle exception properly }  
long usedMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println("Used memory: " + usedMemory + " bytes");
System.out.println("Used memory: " + (usedMemory/1024)/1024 + " megabytes");

To obtain the overhead due to VIATRA Query, we simply measure the consumption for the case when only the EMF instance model is loaded, and subtract that value from the original measurement.

In more precise measurements, we use the JConsole or a profiler (such as YourKit) to obtain more accurate results. This method is also the preferred approach if you are evaluating the transient memory impact of tools (i.e. temporary heap allocations that are released by the garbage collector after the execution has reached a steady state). Note however, that such heap transients may be very hard to reproduce deterministically due to i) aliasing effects of the profiler (the transients may be so short lived that they do not show up on the chart) and ii) inherent non-determinisms in the way the JVM works (garbage collection anomalies, or operating system kernel-specific issues).

Optimizing queries and transformations

To optimize VIATRA Query patterns for performance, we recommend to keep to the following simple best practices:

  • Write reusable patterns: factor out commonly used sub-patterns into separate patterns and use find() calls for re-use. This helps cleaning up your code, and also helps the Rete engine to store the matches for commonly used sub-patterns only once (thereby reducing memory consumption). Constraints already expressed in call patterns need not be repreated in the calling pattern.
  • Avoid "Cartesian product" patterns if possible: pattern variables in a pattern should be "connected" to each other via positive constraints (node and edge constraints, positive pattern calls), otherwise all combinations of their potential values must be enumerated and individually checked by the pattern matcher. Note that other constraints (e.g. negative calls, check() expressions) are not suitable for "connecting" the pattern.
  • Simplify check() expressions. Check() expressions may contain additional constraints (typical examples include string operations such as .contains or .startsWith, arithmetic/logical comparisons or equivalence tests, etc) that may include (very) costly operations. In the case of performance issues, it is a good idea to start looking for bottlenecks inside check() expressions and if possible, eliminate them.
  • Linking by edges is good, linking by check() is bad. When expressing the correspondence of two model elements, it is best if you can link them via graph edges, as opposed to comparing their attributes. Or you can check that two objects have the same attribute value by using the same pattern variable to represent the value, and connect it to both objects to this value via attribute edge constraints. Comparing attributes in check() expressions will always be more expensive then these elegant solutions, since the check will have to be evaluated individually for each potential pair of elements (see the Cartesian product problem above).
  • As a last measure, you may also optimize the Rete layout by manual pattern factorization. To improve the performance on patterns with a large number of constraints, try to identify group of constraints that "belong together" and factor them out as subpatterns. For instance, if an expensive operation such as a check() can be evaluated with a subset of a pattern's variables and constraints, they are a good candidate to be factored out together.

Typical performance benchmarking aspects

For model transformation tools, a number of performance benchmark experiments have been reported so far (see Pointers below). We briefly summarize some general remarks below.

  • For model simulation scenarios (e.g. petrinet firing, antworld) that measure the time it takes to execute a single, or a sequence of simulation steps, take care of randomization/non-deterministic effects by e.g. averaging the results.
  • Scenarios involving code generation are sometimes problematic as output formatting, buffering, file operations etc may interfere with your results.

For benchmarking model query tools, we have defined a scenario in our AutoSAR and Train Benchmarks that aims to simulate the most performance-critical aspects of typical modeling tool use cases. It consists of four phases:

  • Model initialization, measuring the time it takes to load (de-serialize) the model, and, in the case of VIATRA Query or the OCL Impact Analyzer, the overhead of additional cache initialization.
  • First query evaluation, measuring the running time for retrieving the query results for the first time. This case corresponds to batch model validation.
  • Applying model manipulation operations, measuring any overhead on model management (as is the case of VIATRA Query or the OCL Impact Analyzer).
  • Second query (re-)evaluation, measuring the running time for subsequent queries (that are typically much faster for incremental technologies).

Some general recommendations that are good to check for performance benchmarks:

  • Check for correctness and fairness: are compared tools returning the same results and doing functionally equivalent computations?
  • Minimize auxiliary distortion effects such OS-level caching, file operations, background task interference, and memory management issues (see the HOWTO section for details).
  • Design measurement parameterization carefully; it is frequently non-trivial to scale up instance model sizes in a way that corresponds to actual modeling practice (threats to validatity).
  • If possible, audit your configuration and code with experts of the measured tools.
  • Interpret results carefully. Consider to what extent are certain factors (such as model management differences between interpretative and compiled transformation tools) dominating in results? How do certain characteristics of your model/queries/transformations favor certain tools/technologies?

Back to the top