Missing Link

January 2nd, 2007
Update: I shouldn’t be so trigger-happy with blog entries. Summary of this post: long, boring analysis of how I thought I was tracking down a bug in ld. Followed by a mea culpa explaining that I’d gotten it all wrong. Enjoy!

Special thanks to Eric Albert for patiently showing me the error of my ways.

If you’re anything like me, the most challenging part of your development is when the process itself doesn’t seem to be working. OK, I designed the architecture of the project. I wrote the code. I expect everything to work. It’s not due until tomorrow, but what’s this? The linker’s failing?

Usually we get to take compiling and linking for granted. These tools are used by just about every shipping binary executable inside or outside of Apple, so chances are they’ve gotten all of the really critical bugs ironed out. Still, you have to wonder when you stare down a perplexing “undefined symbols” error, which you can’t for the life of you imagine being accurate:

ld: Undefined symbols:
_acl_copy_ext_native referenced from CarbonCore
expected to be defined in libSystem
_acl_copy_int_native referenced from CarbonCore
expected to be defined in libSystem
_task_name_for_pid referenced from CarbonCore
expected to be defined in libSystem

The disagreement between two of Apple’s low-level frameworks is a great tip that this isn’t really my problem. Still, I need to make sure I’m not leading the linker astray somehow. I am using SDKs (-isysroot from the command line), so maybe I’ve given the linker mixed signals about where to search for libraries. After scouring the makefiles (in this case it’s the Subversion project), looking for unwanted -L or -F flags that would point the build process back into an unwanted directory, I’m left empty-handed. This should be working!

Confirm LDs Complaints

Just to get my head on straight I decided to confirm that that SDK’s version of libSystem had the symbols that are allegedly “undefined”:

% cd /Developer/SDKs/MacOSX10.4u.sdk                                                                           
% nm usr/lib/libSystem.dylib | grep _acl_copy_ext_native
         U _acl_copy_ext_native
900fe38c T _acl_copy_ext_native

See the capital T? That means “implemented” to you. It actually means “Text Section Symbol,” but what’s important is it’s a symbol for code, not a symbol for an undefined import, like the U right above it. It’s OK for a single binary to have both U and T symbols, because the file will resolve against itself. If a file only has Us for a given symbol, it’s going to look elsewhere for resolution. That’ the situation that CarbonCore is in:

% cd /Developer/SDKs/MacOSX10.4u.sdk                                                                           
% cd System/Library/Frameworks/CoreServices.framework   
% cd Frameworks/CarbonCore.framework              
% nm CarbonCore | grep _acl_copy_ext_native 
         U _acl_copy_ext_native

Getting bored yet? Let me cut to the chase. My SDK’s CarbonCore and libSystem are in perfect agreement. CarbonCore wants some _acl symbols, and libSystem has them. Looking at my installed system, however, I discover that these symbols are completely absent. Apple has changed the inter-framework dependencies at some point, such that my OS (10.4.8) is not compatible with my 10.4 Universal SDK.

Which implies that the linker is looking in my OS instead of the SDK, when trying to resolve these symbols. But why!? And do I really have to start debugging the linker?

Spy On Dynamic Linking

If you find yourself in a similar boat, there are some debugging aids you should know about. Set the environment variables LD_TRACE_ARCHIVES and LD_TRACE_DYLIBS to encourage ld to spit out some potentially useful information. When I did so, I was surprised to see it confess openly just how many libraries it was pulling in from my root system, instead of from the SDK. In particular:

[Logging for XBS] Used dynamic library: /usr/lib/libSystem.B.dylib

Aha! So at least I know why it’s got its head screwed on wrong. But what causes this? I’m specifying the SDK path with the “-isysroot” parameter to the gcc, which in turn ends up getting passed as a “-syslibroot” parameter to ld in the link phase. What’s supposed to happen here is that the provided path gets prepended to all full path library references, and that path is given priority in the search .

So I don’t think it’s too far out of line for me to believe that the log line shown above is a sign of a bug in ld. In no case should a library be used from “/” when a perfectly good counterpart exists in the SDK tree.

Hack The Linker

Thanks to Darwin, I find myself downloading and building my own copy of ld. At least I can take a look at what is supposed to happen, if not necessarily fixing the problem. I peppered the sources with extra printf statements just so I could figure out what was happening when. It turns out that the linker’s failure to stay in the SDK tree is caused by the resolution of 2nd-level dependent libraries. That is, I link to one dylib and it has some other libraries listed in its dependencies. When ld gets to those libraries, it adds them to its list (if they’re not already there), but somehow the SDK path doesn’t get prepended, so it just uses the verbatim path. I tracked down a spot in the source code where the right thing is attempted:

if(next_root != NULL && *dylib_name == '/'){
	p->file_name = allocate(strlen(next_root) +
				  strlen(dylib_name) + 1);
	strcpy(p->file_name, next_root);
	strcat(p->file_name, dylib_name);
	p->dylib_file = p->dylib_name;
}

Emphasis mine to show that we’ve got a bug here: the code is trying to replace the existing dylib path with one that has “next_root” prepended to it. But after going to all the work of building that string, it simply ignores it, assigning the base name as the file name.

Unfortunately, even after fixing this bug and building a custom version of ld, the problem was not completely fixed. I am resisting the urge to become too much more familiar with this code, but so far I suspect that the “patched up dylib” continues to have the table of contents for the old dylib. So even though it’s now pointing into my SDK, it’s still not finding the desired symbol names in the list.

All in all I think this just reveals that “ld” is a lot less ready for prime-time in the SDK department than I’d hoped. I’ll admit that it takes some fairly unusual circumstances to detour the dylib search out of the SDK: I don’t even have the logic completely worked out yet. But I’m standing by my belief that “whenever a qualified library in the SDK path exists, it should be used in favor of a library outside the SDK.” Radar #4904317

By the way, I’ve found at least one instance via google of somebody else having exactly this same linking error (with the _acl* symbols). This was in the WebKit project, which I’m sure also has complex dynamic linking dependencies. I notice in that email thread that the problem just mysteriously disappeared one day. What I bet happened is the person updated their Xcode or system so that the SDK and system were once again in sync. It doesn’t mean the bug isn’t still happening, it just means its quiet because the wrong version of the dylib happens to look a lot like the right one. I suspect this bug is happening a lot more than people realize, but in mostly innocuous ways.

Update: Mea Culpa

Well, when you write your thoughts out to the public you’ve gotta take the victories with the defeats. I’m a little embarrassed to admit that I’ve been set straight by Apple’s Eric Albert, who informed me that everything would “just work” if I removed a bunch of full-path references to /usr/lib that were in my link line. It was dumb of me to assume a bug in ld … just like my second paragraph above says, “chances are they’ve gotten all the really critical bugs ironed out.”

So if it’s not a bug, what is it? A caveat! I jumped to conclusions when I got the source code and started mucking about. My interpretation of the source code supported my idea that ld should aggressively look for libraries in SDK, which might have been the goal at one point in the project’s history. But what’s in the source code doesn’t matter. The man page for ld clearly describes the -syslibroot option:

-syslibroot rootdir (32-bit only)
	Prepend rootdir to the standard directories
	when searching for libraries or frameworks.

Now that I’ve been set straight I know where to put the stress: on the word searching. The SDK rootdir doesn’t apply to libraries referenced by full path, some of which were erroneously present in my test case. Since these erroneous references showed up on the command line before the corresponding -l for some critical system libraries, they caused the insertion of non-SDK dylibs into ld’s search list.

In summary: if you’re using an SDK you must rely on ld’s search mechanism to find all the Apple system libraries you depend on. If you make the mistake of specifying any of them by full path, you run the risk of leading ld astray from the SDK directory. You can’t expect ld to prefix full-path specifications with the SDK path.

7 Responses to “Missing Link”

  1. rentzsch Says:

    I link-blogged this, but I’ll also put it here for better locality of reference:

    I love this entry. It’s the common case than I find “here’s a bug!” + later “oh it’s not a bug, here’s the story”-style postings more valuable than straight-up “publishing-what-I-know” pieces.

    Bottom-line: I like trigger-happy blog entries penned by smart guys like you when they enumerate solution attempts, and provide an interesting back story.

  2. Daniel Jalkut Says:

    Heh, thanks for saying so, rentzsch :) I replied to somebody off-line comment, saying that I figure mistakes go with this profession, and are also educational in their own way.

  3. Chucky Says:

    “You can’t expect ld to prefix full-path specifications with the SDK path.”

    I wish someone had let me know this before my wedding night.

  4. Mike Zornek Says:

    I feel for ya.

    Just last night I narrowed a core data relationship bug down to it’s there when I deploy but not during regular development (even if I use the release config).

    At first I though I was getting a different result from the xcode command line than the GUI but then I eventually found a bad model in svn. Schema 6 at some point was edited and was in the repo but my my ~/Projects/Billable folder. This by the way took about 3 hours to really understand.

    Don’t sweat it! :-)

  5. Scott Stevenson Says:

    I did a whole blog post about a Rails migration error only to have Dominik Wagner point out to me that the variable name was wrong. It was, like, ten lines of code.

    I win.

  6. Mike Laster Says:

    I ran into the *exact* same problem you did trying to build a Universal version of Subversion. What change did you make to make this problem go away?

    I followed the instructions on TN2137:

    CFLAGS=”-O -g -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc” LDFLAGS=”-arch i386 -arch ppc” ./configure
    make

    and I get a link error:

    cd subversion/libsvn_ra_dav && /bin/sh /tmp/subversion-1.4.3/libtool –tag=CC –silent –mode=link gcc -O -g -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch i386 -arch ppc -arch i386 -arch ppc -L/tmp/subversion-1.4.3/apr-util/xml/expat/lib -rpath /usr/local/lib -o libsvn_ra_dav-1.la commit.lo fetch.lo file_revs.lo log.lo merge.lo options.lo props.lo replay.lo session.lo util.lo ../../subversion/libsvn_delta/libsvn_delta-1.la ../../subversion/libsvn_subr/libsvn_subr-1.la /tmp/subversion-1.4.3/apr-util/libaprutil-0.la /tmp/subversion-1.4.3/apr-util/xml/expat/lib/libexpat.la -liconv /tmp/subversion-1.4.3/apr/libapr-0.la -lresolv -lpthread /tmp/subversion-1.4.3/neon/src/libneon.la -framework Security -framework CoreFoundation -framework CoreServices -lz
    libtool: link: warning: `/usr/lib/gcc/powerpc-apple-darwin8/4.0.1/../../..//libiconv.la’ seems to be moved
    libtool: link: warning: `/usr/lib/gcc/powerpc-apple-darwin8/4.0.1/../../..//libiconv.la’ seems to be moved
    libtool: link: warning: `/usr/lib/gcc/powerpc-apple-darwin8/4.0.1/../../..//libiconv.la’ seems to be moved
    ld: Undefined symbols:
    _acl_copy_ext_native referenced from CarbonCore expected to be defined in libSystem
    _acl_copy_int_native referenced from CarbonCore expected to be defined in libSystem
    _task_name_for_pid referenced from CarbonCore expected to be defined in libSystem
    /usr/bin/libtool: internal link edit command failed
    lipo: can’t figure out the architecture type of: /var/tmp//ccFJj5J0.out

  7. Daniel Jalkut Says:

    Mike: I wish I had more concrete instructions for how to get around all the niggling problems, but to be honest what I ended up doing was just going through and painstakingly figuring out the failures one at a time, and sometimes rather crudely hacking around the failures.

    I believe the failure you’re seeing there is based on a problem where the System path is getting searched for libraries, instead of fetching all libraries from the SDK path. You need to go through and eradicate all references to “/usr/local/lib” for instance. Maybe taking out that -rpath argument would do the trick.

    But then you have to look in those .la arguments and make sure THEY don’t list any / relative libraries in their dependencies.

    There is probably a “right way” to fix it but that’s how I would pursue it if you just need to hack out a solution.

Comments are closed.

Follow the Conversation

Stay up-to-date by subscribing to the Comments RSS Feed for this entry.