How One Little Method In java.io.File Cost Us 2 Weeks

A little method in java.io.File made for one hard to reproduce error and plenty of lost time

How One Little Method In java.io.File Cost Us 2 Weeks

Last year, we helped Global Bank upgrade from JBPM v5 to 6.

And since BPM6 was basically a redesign, there were numerous hurdles to migrate inflight processes from v5.

But we worked tirelessly with the Global Bank team to resolve each one.

We couldn’t wait to put it into production. It would be a big win for Global Bank and Red Hat.

Unfortunately, we had to delay until the next release: R21.

No problem. Everything would be smoother then. Or so we thought.

R21

We had just finished testing our changes in R21, and now we were ready to deploy to higher environments.

That’s when I got a message from one of the Global Bank developers.

"Hey TO, we’ve got a problem that is preventing us from testing in DIT"

Whenever he tried to complete a task, he’d see this error:

Could not find process x when restoring process instance 123456

The problem is, this bug occurred sporadically on different servers.

But that's ok. Everyone loves an error that's hard to reproduce...right?

Our architect later joined the call to suggest the most basic solution:

"If other servers work and this one doesn't, why don’t you restart this server? Maybe it didn’t come up cleanly.."
"A restart? Are you serious? Come on.."

One Restart Later

“It must have been temporary insanity.."

That was half right.

What we wrong about the was “temporary” part.

Tick, Tock

“If you don’t fix this by next Wednesday, we’ll need to pull your code out of R21”

The error had appeared on different servers over the next week. And our reboot solution was only working sometimes.

“We have our support team pouring through the logs. And we've been on calls during the night and throughout the weekend to resolve this.”

That’s consultant speak for “We haven’t slept in days”.

Anyway, Global Bank is using the embedded JBPM model. So we assumed that something went wrong in the three steps it takes to start a process:

  1. Install the kjar (a packaged artifact of business processes) into the maven repo
  2. Deploy the kjar using DeploymentService
  3. Start a process

But none of our theories had panned out.

Clear out the maven repository! Fail.

Check the permissions on the .m2 folder! No sir

Check the WebSphere temp folder for cached processes from v5! Goose-egg

We were running out of time. That’s when, after scanning through thousands of log messages, we found this gem.

[Process Compilation error. WorkflowProcessInstanceImpl cannot be resolved to a type]

Basically, the system is telling us that one of our process files references a class that is not imported. And if the kjar doesn’t compile, it won’t get deployed. Hence the “Could not find process error”.

It was another theory. It just didn’t make sense because..

If it compiled incorrectly, why would it work sometimes? And why did a reboot fix it sometimes?

But it made perfect sense once we discovered..

One Little Method In java.io.File: listFiles()

listFiles() returns an array of all filenames in a directory. It’s especially useful to quickly loop through a directory and perform operations.

But there’s a caveat. listFiles() doesn’t guarantee that filenames will be listed in the same order every time. Not only does this make the operation faster, but order usually shouldn’t matter.

Here’s Why It Mattered to Us

In our kjar, we have three process files. (p1, p2, p3)

p1 and p2 contain an import statement for the missing class above. So if either of these load first, p3 doesn’t throw any errors for WorkflowProcessInstanceImpl.

But in p3, that import statement was missing.

So when p3 loads first, we get a compilation error because the compiler can’t resolve the FQCN. We receive different errors depending on the order processes are loaded & compiled.

Guess which method the kie-maven plugin uses to load these assets ?

Yep.

And since we compile the kjar at deploy time, the order is different on each restart. We had a 1/3 chance of seeing this error every time.

Lessons Learned

“We have found the root cause and implemented a solution”

That’s consultant speak for..

It was Tuesday afternoon, and we had made the deadline by the skin of our teeth.

And even though we resolved the problem, there are two things we could have done better.

  1. Enable the kie-maven-plugin in Jenkins - Due to customer environment issues, we weren’t able to enable this plugin yet. We had put it as a low priority item for later & never came back to it. The plugin wouldn’t have solved the problem. But since it precompiles the workflow files, it would ensure that only correctly compiled kjars got deployed
  2. Better exception handling - Since Global Bank’s entire application is dependent on JBPM, any compilation errors in the deployment should kill the application deploy (throw RuntimeException). This way, no server would have been able to start with this error. Swallowed errors swallow problems.

But..a win is a win, I suppose. Global Bank deployed our fix into their DIT environment and testing resumed.

Great work, guys. I think we’re actually going to make it into this release!

It’s a shame that within the next 12 hours, I would regret saying that too.

Happy Coding,

-T.O.