EMFxcel: Having Fun with Excel and Eclipse Modeling Tools

In order to be able to do advanced data processing and visualization with technologies I am familiar with, I created a simple tool to process Microsoft Excel® documents using Eclipse Modeling Framework. The solution consists of an Xcore-based Excel domain, and a slightly modified EMF generated editor that is able to open Excel files directly as EMF instance models via Apache POI, and automatically track and propagate changes as the opened file is being simultaneously modified in Excel. I demonstrate live data processing using EMF-IncQuery graph queries, and live data visualization using IncQuery Viewers.

The source code and sample models that belong to this example is found at https://github.com/ujhelyiz/EMF-IncQuery-Examples/tree/master/emfxcel, and it requires the brand new EMF-IncQuery 0.8 release that can be downloaded from the Eclipse Marketplace (among others).

The problem

In my experience, next to traditional programmer tools such as IDEs and compilers, Microsoft Excel® is one of the most frequently used software development programs. This is because even today, many organizations tend to keep documentation, requirements, and other auxiliary data related to software development processes in Microsoft Office documents, and particularly Excel spreadsheets. Managers love spreadsheets, right? :-)

So, many of us need to work with Excel documents, and sometimes this goes beyond just opening these files and looking at their contents. What if, say, you need to process the contents programmatically, in order to perform (non-trivial) calculations or extract data to feed into some other program? The obvious solution, one might think, is to use Excel's built-in facilities such as macros.

Here is a simple example problem that I had to solve the other day (I modified the contents and schema to simplify things quite a bit). We have a document that consists of multiple worksheets, each of which contains several rows and columns — in our example, students and courses. We want to check if there are any students who applied to multiple courses, and if there are any who applied to none. 

So, what if you are like me, and don't want to learn advanced Excel magic just to be able to do a few such tricks using Excel data? Furthermore, what if you are already familiar with Eclipse Modeling? Isn't there a way to map Excel into an EMF model and use your favourite Eclipse Modeling technologies such as EMF-IncQuery and Xtend to do the processing? Now there is.

The solution

I was initially sure that this is such a common problem that somebody must have already come up with something. It turned out not to be the case, so I rolled my own and it proved to be a very smooth and pleasant experience. Read on below.

Modeling Excel in Xcore

The first step is to create a simple domain model to represent Microsoft Excel documents as EMF instance models. I used the excellent Xcore framework to draft such an Ecore domain model in an easy-to-use textual syntax. See below for the code.

My approach was to build something simple but already powerful and easily extensible. The basic building blocks (Workbook, Worksheet, Row and Cell) are motivated by the concepts of the Apache POI library and I added a bit of redundancy (namely, explicit indices in Cell elements) to make data processing easier later on. Data types are handled by subtypes of Cell such as BooleanCell, StringCell and NumericCell. These few subtypes were enough for my purposes, but other can be easily added if necessary.

Generating the EMF Editor

The next step is to generate an EMF model editor, in order to be able to test drive our domain model. (I know, this could be done using dynamic instances, but I prefer generated editors.) So I added the @Ecore and @GenModel annotations to my Xcore file (see above) so that the necessary code is generated. Once that is done, I ran the generated editor as an Eclipse Application and created some example models similar to the one below.

Overriding the Editor

The final touch was to make sure that the editor can open actual Excel documents and instantiate my domain model with their contents. As stated before, I integrated the Apache POI library and wrote some simple code to do the transformation from POI objects to the corresponding EObject counterparts. See the populateResource() funtion below for details.

My editor contains a further enhancement: it registers a listener on the workspace, so that whenever the opened file is changed from the outside (e.g. by Excel), this change is picked up and the contents of the EMF instance model are repopulated with new data. This will be important later on. See the registerResourceChangeListener() function below for details.

Once these things were done, I was able to open actual .xlsx files with my editor and see their contents. Now on to the fun part.

Model Queries with EMF-IncQuery

Okay, let's do some live data processing on our Excel documents. To solve my problem, I wrote a couple of EMF-IncQuery queries. If you are not familiar with the syntax of the language or how the IncQuery development tools work, I suggest to read the IncQuery Wiki, and check out some other, more detailed examples.

Here, I am making full use of a new IncQuery 0.8 feature called "Query Libraries", which allows to reuse queries defined in a different file (which may be in a different project or bundle, as long as it is available on the classpath of my IncQuery project). So first, I wrote some generic queries that will be helpful to more concisely and simply code queries specific to my problem. See the code below.

In this query library shown in the snippet above, I wrote three helper queries:

  • CellInWorksheet(C: Cell, WorkSheetName, RowNum, ColNum) is useful to identify individual cells by a worksheet name (that you can see on the Excel UI), and the respective row and column indices (all integers starting from 0).
  • NamedColumn(WorkSheetName, ColName, ColNum) is useful to identify a column index as a "named" column, i.e. when the 0th row contains the string substituted for ColName.
  • valueInNamedColumn(wsName, colName, rowNum, value) is useful to "extract" a cell value from a worksheet and named column, indexed by row number (rowNum).

My problem-specific queries demonstrate the usage of the last one (the first two are actually only used internally). See below.

Here, I am now writing domain-specific queries using the generic queries defined in my query library above.

  • student(name, id)teacher(name) and enrolledStudent(courseName, teacherName, studentId) are helper queries that aid in identifying important concepts.
  • didNotApplyForAnyCourse(name) and appliedToMultipleCourses(name) are the queries that finally solve the original problem.

The screenshot below illustrates what I see after loading both the query definitions and my example instance model into the Query Explorer using the green play button.

Visualization with IncQuery Viewers

IncQuery Viewers is an add-on to EMF-IncQuery that is intended to support powerful and automatic data visualization using standard Eclipse UI technologies such as JFace and Zest. The basic idea is that using Viewers, query results can be automatically mapped to UI contents such as list/tree entries or graph nodes and edges. It is important to note that this is live visualization, which means that whenever the underlying model changes, the query results are incrementally recomputed by EMF-IncQuery, and these changes are also incrementally propagated to the UI through Viewers.

To showcase the powerful visualization capabilities of IncQuery Viewers, I added a couple of annotations to my domain-specific queries above.

  • @Item annotations are used to make students and teachers appear as items (list/tree entries and graph nodes),
  • @Edge annotations are used to make relationships computed as query match results appear as graph edges.

The screenshot below demonstrates how the results are automatically rendered by IncQuery Viewers as a 2D graph powered by the Zest framework. The IncQuery Viewers Sandbox view is triggered automatically by right clicking on the queries in the central section of the Query Explorer, and selecting "Reinitialize IncQuery Viewers" from the menu.

Tracking Changes Automatically

OK, on to the final touches: we want to make sure that whenever we edit the document in Excel, the changes are propagated into the Eclipse tool so that query results and visualization contents are automatically updated. We turn on native hooks in order to make Eclipse automatically react to external file changes (since Excel will be running outside of Eclipse, of course). The setting is found under Preferences.

Let's see how this works. The screenshot below shows my Excel documents at the staring point.

This is what I see in my Eclipse tool:

Now let's assume I do some editing in my Excel document (I add a new row to the Courses worksheet), and then push save:

After a few milliseconds, the changes are automatically propagated into Eclipse, and the contents of the Query Explorer and the IncQuery Viewers Sandbox are updated according to my changes.

That's it!