A Case Study of Scalability Related "Out of memory" Crash in Erlang

We are building a platform for message switching, in Erlang. Everything looks OK on stability and features. It actually has run more than half year with zero down. We tested its performance on our 2-core CPU machine before, and got about 140 transactions/second, it's good enough.

Then, we got a 8-core CPU machine several weeks ago, and we did same performance testing on it, to see the scalability. Since Erlang is almost perfect on scalability, you can image the result, yes, about 700 transactions/second now, scaled almost linear. Until it crashed with "out of memory" when million hits processed.

It left a very big "erl_crash.dump" file there, I had to dig the issue. My first guess was, were some remote requests (access db, access remote web service etc) timeout but the process itself was not timeout yet, and cause more and more processes kept in VM?

A quick grep "=proc:" erl_crash.dump showed that the total number of processes was about 980, which was reasonable for our case.

So, which process ate so many memory? A quick grep "Stack+head" erl_crash.dump showed that there was indeed a process with 285082125 size of Stack+head there.

Following this clue, I caught this process:

=proc:<0.4.0>
State: Garbing
Name: error_logger
Spawned as: proc_lib:init_p/5
Last scheduled in for: io_lib_format:pad_char/2
Spawned by: <0.1.0>
Started: Sun Apr  1 01:21:50 2012
Message queue length: 2086029
Number of heap fragments: 1234053
Heap fragment data: 281266956
Link list: [<0.27.0>, <0.0.0>, {from,<0.42.0>,#Ref<0.0.0.88>}]
Reductions: 72745575
Stack+heap: 285082125
OldHeap: 47828850
Heap unused: 121777661
OldHeap unused: 47828850
Program counter: 0x0764c66c (io_lib_format:pad_char/2 + 4)
CP: 0x0764c1b4 (io_lib_format:collect_cseq/2 + 124)

This process was error_logger, which is from OTP/Erlang standard lib: error_logger, writing received messages to log file or tty. The typical usage is:

error_logger:info_msg("~p:~p " ++ Format, [?MODULE, ?LINE] ++ Data))

Which will format Data to a String according to the Format string, and write it to tty or log file.

he above case showed the message queue length of process "error_logger" had reached 1234053, and the Stack+heap was 285082125, about 272M size.

So the cause may be, that the message queue could not be processed in time, the messages were crowded in error_logger's process and finally caused "out of memory". The bottle-neck was that when error_logger tried to format the message to String, Erlang VM was weak on processing them, which seemed to need a lot of CPU cycles. In my previous blog, I talked about Erlang is bad on massive text processing. Erlang processes String/Text via List, which is obvious bottle-neck in Erlang now, with Erlang is getting much and much popular and more and more Erlang applications are written.

But, why this did not happen on our 2-core CPU machine? It's an interesting scalability related problem:

"error_logger" module will registered one and only one process to receive and handle all log messages. But Erlang VM's scheduler can not distribute ONE process to use multiple CPUs' computing ability. In our 2-core machine, the whole ability is about 140 transactions/second, the one process of "error_logger" just happened to have the power to handle corresponding log messages in time. Under 8-core CPUs machine, our platform scales to handle 700 transactions/second, but there is still only one process of "error_logger", which can not use 8-core CPUs' ability at all, and finally fail on it.

Erlang treats every process fairly (although you can change the priority manually), we can do a simple/quick evaluation:

  1. 2-Core machine, keeping hits at 140 trans/second:

The number of simultaneous processes will be about 200, each process shares the CPU cycles: 1/200 * 2 Core = 1%

  1. 8-Core machine, keeping hits at 700 trans/second:

The number of simultaneous processes will be about 980, each process shares the CPU cycles: 1/980 * 8 Core = 0.82%

So, the CPU cycles shared by error_logger process actually not increases. BTW, I think error_logger should cut its message queue when can not process them in time (disk IO may also be slower than receiving messages).

  • Posted: 2008-12-31 08:00 (Updated: 2010-01-10 14:59)
  • Author: dcaoyuan
  • Categories: Erlang

Comments

1. Geoff Cant -- 2009-01-01 08:00

Next time you have an erl_crash.dump file you might want to take a look at crashdump_viewer in the standard distribution. It parses the dump file and lets you examine all the processes (including message queue and process heap contents) from a handy web interface.

2. Caoyuan -- 2009-01-03 08:00

Geoff,

Thanks. But I got erl_crash.dump is not an Erlang crash dump