How to Quickly Repair Corrupt Data in JBPM

It had been a rough two weeks for us and our client — Global Bank.

We had worked tirelessly to resolve a nasty JBPM bug — one that was fixed by a adding a single line of code.

Don’t you hate those?

Anyway, it looked like we would make it into release 21 (R21) — even by the skin of our teeth.

So on Tuesday, our team had a mini celebration.

But in just twelve short hours, our plans would be thwarted by a corrupt process byte array.

The Email of Death

The next morning, we received an email from the testing team about another defect.

The email read:

When we try to complete a task (in BPM6) that was created in BPM5, we’re seeing this error:

java.lang.NullPointerException
at org.jbpm.workflow.instance.node.WorkItemNodeInstance.workItemCompleted(WorkItemNodeInstance.java:383) ~[jbpm-flow-6.5.0.Final-redhat-15.jar:6.5.0.Final-redhat-15]
at org.jbpm.workflow.instance.node.WorkItemNodeInstance.signalEvent(WorkItemNodeInstance.java:361) ~[jbpm-flow-6.5.0.Final-redhat-15.jar:6.5.0.Final-redhat-15]
...

Basically, the BPM engine is telling us that it cannot locate the work item for this task. And therefore, it fails to complete it.

A work item represents a unit of work in jBPM. Any time we need to execute work (send email, access web service, etc), a work item is created.

After some digging, we found that the work item id (used to lookup the work item) had been corrupted. This work item id is stored inside the process instance byte array.

The good news? This only happens with processes started in jBPM 5.

The bad news? Global Bank has hundreds of thousands of those. And if we have this problem, we won’t be able to complete those tasks in production.

Yeah…we definitely going to miss R21.

But before we could take a breath..

Sooo….when can you guys have this fixed by? R22 goes to DIT two months

And just like that, the clock started ticking again.

Process Corruption

Later that week, RedHat Support explained why this problem happens:

There was a bug in BPM5 that causes the persistence manager to get flushed in the middle of the transaction. This causes a corrupt work item id to get saved

And with the help of our support team, we came up with a solution.

  1. Detect a corrupted task
    • workItemId == -1 and the node name matches the task we’re completing
  2. Cancel & retrigger the node
    • This forces the engine to recreate the corrupted node w/ a new work item id and persist to DB
  3. Complete the corrupt task without error & continue the process
private UserTaskService userTaskService;
private DeploymentService deploymentService;
private ProcessService processService;

private void completeUserTask(TaskSummary taskSummary){
    ...

    final long pInstanceId = taskSummary.getProcessInstanceId();
    retriggerCorruptedWorkItem(pInstanceId, taskSummary.getName());

    userTaskService.complete(taskSummary.getId(), userId, params); // #3

}

private void retriggerCorruptedWorkItem(final long processInstanceId, final String taskName){
    processService.execute(deploymentService.getDeploymentId(), new GenericCommand<Void>() {

    private static final long serialVersionUID = 1L;

    @Override
    public Void execute(Context context) {
       KieSession ksession = ((KnowledgeCommandContext) context).getKieSession();

       WorkflowProcessInstance pi = (WorkflowProcessInstance) ksession.getProcessInstance(processInstanceId, true);
       Collection<NodeInstance> instances = pi.getNodeInstances();

       for (NodeInstance ni : instances) {
           if (ni instanceof WorkItemNodeInstance) {

               WorkItemNodeInstance wi = (WorkItemNodeInstance)ni;
               
               // #1
               if (wi.getWorkItemId() == -1 && wi.getNodeName().equalsIgnoreCase(taskName)){
                  wi.retrigger(true); // #2
               }
           }
       }

       return null;
    }
    });
}

** Code above uses Services API here **

Now, the corrupted task will be completed and a new (duplicate) one will be created. We just needed to test this with multiple corrupted processes.

That’s where the problem lies.

In the database, work item ids are stored in a process instance byte array —  a serialized form of the ProcessInstance object. And there’s not a SQL query out there that can search inside that.

We needed a way to search the process instance byte arrays of all records in the DB.

And we needed it fast.

Enter Migration Manager

The jBPM engineers developed MigrationManager to migrate active process instances of one process definition to another. Here’s why that mattered to us.

To migrate a process, the MigrationManager API has to manipulate process instance and node data. So not only is it a migration tool, it’s a great starter project to access or repair BPM data as java objects.

With a few changes, we can transform the out of the box MigrationManager into a..

Corrupt Process Detector

Ok, so the name needs a little work. 

Basically, we changed some code to traverse each process instance and check the work item id. If it’s -1, we print out the process instance id. The code for that is below.

In fact, you can use it to perform any operation on jbpm data.

Download here

[Credit to Salifou & Olu for this project]

After we found a list of corrupted processes, we tested our fix. And…it worked!

We still missed R21, but we were glad to have a solution well before R22.

But unfortunately, this wasn’t the last time we had to deal with corrupted process instances :/

Happy Coding,

-T.O.