I am using AWS to host some pretty heavy computer vision processing and have run into a problem that is very hard to diagnose.
The image processing infrastructure is configured as a worker instance in Elastic Beanstalk using AWS SQS. SQS will post to a node server on the instance, and that node server will spin off a java child process. For testing, I have it limited to one task at a time, and each task can only process once, so there is no concern about multiple java processes running in parallel. The actual software is deployed using Docker so that the OpenCV dependencies are easy to manage, and so that it's actually possible to spin off additional nodes in a reasonable manner.
The actual java code will load an image into memory and then perform 100,000s of sequential matrix transformations until it converges on a solution. The math is all performed using OpenCV.
The way the software is configured, all FileIO occurs at the beginning, then the remainder of the process until the result has been calculated is in memory. Nothing should be written to disk.
When running the program on an maxed out iMac it runs in about 23 seconds, so pretty quick. When running on a more powerful windows machine (per CPU and GPU spec), it takes about 12 minutes. When running on an m4.xlarge instance in AWS, it fails at around 50. The time doesn't really matter for what's going on now, but it's relevant in comparing the differences we're looking at. Instead of compiling OpenCV locally, we're using nu.pattern OpenCV with Gradle. Each instance has the dependencies pre-installed.
When the process fails, it doens't just crash, the instance in AWS goes into completely maxed out ReadOps for about an hour. The read queue is so backed up that the entire IOPS quota for the instance is consumed, and all other auxiliary processes I had been running (logging, ssh, etc) immediately fail. After another hour or two if we're lucky, the ReadOps go down, and I get get in and pull the logs.
At this point it's always the same thing: Node's V8 engine ran out of memory.
Now for what I've tried to test:
1/ I've tested node's child processes both spinning off node processes and java processes to make sure memory allocation is consistent (I'm using spawn). I've used iotop to examine the memory allocation of each process and the child process memory was never counted under the parent. (as expected) Additionally, because the memory space was counted separately, changing max_old_space_size
had no impact (as expected)
2/ I've ran VisualVM analysis locally and verified that the memory space of the actual java process never exceeds 170 MB. Garbage collection is working as expected
3/ I've explicitly forced garbage collection to make sure that the objects were freed and the matrices in memory remained low. The error still occurred.
4/ I've verified the iotop operations on the node, and within the docker instance. While the process is running, the only IO I see is from logging and from the ext4 journal utility. All of the these processes appear normal.
When the error happens, I lose connection to the node. I've been unable to reproduce the issue outside of AWS, and have been unable to identify what the root failure point is. And ideas on other avenues to examine?
via Nathan Tornquist
No comments:
Post a Comment