Hyperledger Fabric Deployment on Alibaba Cloud Environment – SIGSEGV Problem Analysis and Solutions
According to recent feedback from the Hyperledger community, when the open-source blockchain project Hyperledger Fabric is deployed in an Alibaba Cloud environment, SIGSEV-related fatal errors occur. Based on my experience with this problem, I want to share my analysis process and solution for your reference.
Problem Description
During deployment of Hyperledger Fabric, the startup of the peer and orderer services fails and an error is returned when cli-test.sh is executed on the CLI container. All the error messages are signal SIGSEGV: segmentation violation.
The following is an error log example:
2017-11-01 02:44:04.247 UTC [peer] updateTrustedRoots -> DEBU 2a0 Updating trusted root authorities for channel mychannel
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f9d15ded259]
runtime stack:
runtime.throw(0xdc37a7, 0x2a)
/opt/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
/opt/go/src/runtime/sigpanic_unix.go:12 +0x2cc
goroutine 64 [syscall, locked to thread]:
runtime.cgocall(0xb08d50, 0xc4203bcdf8, 0xc400000000)
/opt/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4203bcdb0 sp=0xc4203bcd70
net._C2func_getaddrinfo(0x7f9d000008c0, 0x0, 0xc420323110, 0xc4201a01e8, 0x0, 0x0, 0x0)
Analysis Process
After in-depth analysis and testing, as well as being inspired by the Hyperledger Fabric bug at https://jira.hyperledger.org/browse/FAB-5822, I came up with a method to identify and solve the problem.
In Docker Compose, I added the line GODEBUG=netdns=go to the PEER, ORDERER, and CLI environment variables in the YAML file. After this setting, the pure go resolver instead of the cgo resolver was used. According to the error log, the error is thrown by the cgo resolver.
I further analyzed the situations in which golang switches between the cgo resolver and the pure go resolver:
Note: For the golang official documentation, visit https://golang.org/pkg/net/.
Name Resolution
The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default, the pure Go resolver is used because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:
export GODEBUG=netdns=go # force pure Go resolver
export GODEBUG=netdns=cgo # force cgo resolver*
Comparing the Go and Cgo Resolvers
Based on this clue, I made a comparison between the underlying configuration files of the successful deployment environment and the deployment failure environment and found the following difference:
In the container in the old environment (in which the blockchain is successfully deployed), shows:
cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
In the container in the new environment (in which the blockchain deployment fails), shows:
cat /etc/resolv.conf
nameserver 127.0.0.11
options timeout:2 attempts:3 rotate single-request-reopen ndots:0
Due to this difference, the pure Go resolver is used in the successful deployment environment, while the cgo resolver is used in the deployment failure environment because the environment contains "options single-request-reopen" that is not supported by the pure Go resolver.
Note: Currently, Pure Go resolver only supports ndots, timeout, attempts, and rotate.
https://github.com/golang/go/blob/964639cc338db650ccadeafb7424bc8ebb2c0f6c/src/net/dnsconfig_unix.go
case "options": // magic options
for _, s := range f[1:] {
switch {
case hasPrefix(s, "ndots:"):
n, _, _ := dtoi(s[6:])
if n < 0 {
n = 0
} else if n > 15 {
n = 15
}
conf.ndots = n
case hasPrefix(s, "timeout:"):
n, _, _ := dtoi(s[8:])
if n < 1 {
n = 1
}
conf.timeout = time.Duration(n) * time.Second
case hasPrefix(s, "attempts:"):
n, _, _ := dtoi(s[9:])
if n < 1 {
n = 1
}
conf.attempts = n
case s == "rotate":
conf.rotate = true
default:
conf.unknownOpt = true
}
}
Next, I analyzed what is the cause for the change to the resolv.conf file in the old and new containers. The cause is that the configuration file of the host ECS has changed:
Deployment failure environment - newly created ECS:
cat /etc/resolv.conf
Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.138
nameserver 100.100.2.136
options timeout:2 attempts:3 rotate single-request-reopen
Deployment successful environment - original ECS:
cat /etc/resolv.conf
Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 100.100.2.136
nameserver 100.100.2.138
Additionally, I have tried to find out why SIGSEGV-related errors occur when the cgo resolver is used. The article at
https://tschottdorf.github.io/golang-static-linking-bug explains SIGSEGV-related errors caused by the static link cgo.
In the description of the Hyperledger Fabric bug, the Hyperledger Fabric build (especially getaddrinfo-related methods) is a statically-linked build. For details, visit: https://jira.hyperledger.org/browse/FAB-6403.
Recommended Solution
By now, we have found the root cause and replayed the problem logic: Due to changes to the file resolv.conf of the new ECS host, the inter-container domain name resolution in Hyperledger Fabric is switched from the pure Go resolver to the cgo resolver, an SIGSEGV error caused by the known static link cgo is triggered, and thus the deployment of Hyperledger Fabric fails.
To prevent this issue, update the docker compose yaml template of Hyperledger Fabric, and add an environment variable GODEBUG=netdns=go to all the Hyperledger Fabric nodes (such as orderer, peer, ca, and cli) to forcibly enable the pure Go resolver.
Alibaba Cloud Container Service Blockchain Solution
Alibaba Cloud Container Service provides a basic solution for Hyperledger Fabric automatic configuration and deployment, helping developers avoid complex underlying operations and to be more focused on innovation of the blockchain business application.
For more information, see:
Alibaba Cloud Container Service Blockchain Solution
Blockchain Solution Documentation of Alibaba Cloud Container Service
Reference:
https://www.alibabacloud.com/blog/hyperledger--fabric--deployment--on--alibaba--cloud--environment--%E2%80%93--sigsegv--problem--analysis--and--solutions_593970?spm=a2c4.11999857.0.0