ECE: Electrical & Computer Engineering
ECE News

Multi-thread Multi-Core...
a Single Bug

You can run a complicated program a million times and find no evidence of the bug that will make it fail. But the bug is there, waiting. How do you find it? How do you know you found it?

According to a widely cited federal study, debugging accounts for more than 50 percent of software development costs. Software bugs have been responsible for blackouts and deaths, in addition to security vulnerabilities, product delays, financial loss, and frustration. Debugging has become a crisis, according to Chao Wang, an expert in specification and verification of complex computer systems, who joined ECE recently as an assistant professor. The crisis is compounded, he says, by the growing use of multi-core processors and multi-threaded programs.

Wang wants to help resolve the debugging crisis by building tools that automatically detect and locate bugs through automated concurrency verification. To pursue his plan, he has received a $478,000 CAREER Award, the NSF’s most prestigious grant for early-career faculty members.

Wang explains that “concurrent programs are typically written with multiple threads in parallel, and those threads need to coordinate.” It’s hard for programmers to manually reason through a program with thousands of threads.

artistic rendering of debugging a multi-core/multi-threaded error

The interleaving of multi-threaded programs can cause hidden bugs that aren’t apparent even after significant testing.

“Debugging is a nightmare,” according to Wang. “You can step through a sequential program, but you can only step through one thread of a concurrent program. When you’re debugging, you have to be thinking about what all the other threads are doing. It’s not easy for humans to do.” In practice, he says, debugging often involves the programmer “simply staring at the source code, which is neither economical nor reliable.”

Wang proposes to solve three problems, beginning with efficiently generating the failure-inducing program input and thread schedule. It isn’t enough, however, to just detect a bug, Wang says. You also have to diagnose what is causing the bug, then repair it. He is seeking methods to accurately identify the failure’s root cause and automatically compute a repair. He also wants to determine how to fix bugs that arise from redundant or inefficient use of synchronizations — which can be common in concurrent programs.

His tools will not be fully automated, but will be a powerful debugging aid for programmers. “If it can do a certain amount of work, it will save a lot of time,” says Wang. “If it’s a really complicated bug in a large code base, it will typically take weeks or months for one programmer to trace it down.” Wang would like to reduce this time to days or even hours.

“Today, every CPU is multi-core. You can’t get away from it. We are reaching the point where we can’t find many single-core processors, and if you want to leverage the computing power of the CPU, you have to write multi-threaded programs. That’s why it’s becoming an important problem.”

Every CAREER award has an educational component, and Wang is working to incorporate multi-core programming into undergraduate courses. This is urgent, he says, “because it will help avoid creating another generation of engineers whose first thoughts about concurrency are that it’s scary and always hard.”

Wang, who spent seven years working in industry, explains that although programmers in industry have more experience, on the topic of concurrent programming they aren’t much better than university students. “When they were in school, we didn’t teach this kind of thing.”

So, Wang would like to create a two-to-three week summer course for programmers in industry. He wants to help “retrain industry professionals so they can be brought up to speed with the multi-core revolution.”

The tools Wang is designing won’t require the program to run more than a few times, and maybe only once. A programmer would run the program and log the execution traces, possibly even running them sequentially. “There won’t be any bug,” Wang explains, but the program will look at the traces and reason through all the possible interleavings. “The program could predict a bug without running into it,” he says.

“By logging the traces and statically reshuffling the possible events, you may infer that there is another interleaving that triggers a bug. You can find a bug that you haven’t observed.” Then you just have to block the particular interleaving that triggers the bug.

So that bug that didn’t show up when you ran the program a million times can be caught before it’s a problem.