Page 1 of 1

Licensing and source code availability considerations

Posted: Wed Feb 03, 2010 11:53 pm
by daniel.santos
I'm a bit of a gammer and I recently purchased this nice nVidia 260 with 216 GPU cores. After registering my product, I learned from their web page about this project and thought, "hey, what better way to heat my home?!" Unfortunately, I run Linux exclusively and I quickly discovered that Linux client is incapable of exploiting the GPU. As a consultant / software architect /developer, I immediately sought the sources to see what it would take to add this support and was surprised to discover that the client is closed source. This posting is a plea for the development & management team to consider an alternative to this closed-source approach.

The benefits of an open source application such as this, especially one with such a noble purpose, are too great to number. On the top of the list, I will mention superb peer-review and the contributions of some of the brightest minds in the world (and I dare place myself somewhere in that vicinity, but certainly not near the top!). Were I the manager of such a project, I would have two primary concerns (in order of importance):
  1. Reliability (Correctness) of Data: the integrity of data provided by a client who's code may have been altered in such a manner that introduced logical errors
  2. Server Stability: a negative affect on the stability and/or availability of the server
Server Stability
I'll address this second issue 1st. It is my opinion that absolutely every server application (broadly, any application that will ever listen on a port) of any importance should be written with the deterrence of DoS (Denial of Service) attacks as their first priority. Some DoS attacks indeed occur by accident or because of programming errors in a client application. Any concerns for server stability with clients built from modified code can be addressed in the same way you would make it DoS resistant:
  • Perform a thorough audit of the server code's execution paths. Make sure that every possible anomalous condition is checked for. Make no assumptions about the integrity of anything that comes from the network! Obviously, this is especially the case with copying data into pre-allocated buffers prior to checking the size of the data (but that one is a no-brainer).
  • Add an auditing framework into the code to track connections by IP address. This part is a bit more work. You will want to construct a small database in memory and track general stats for each IP address. This table would use the IP address (or IPv6 if you've advanced that far :D ) as the primary key and contain first tracked connection time, last connection time, and a count of each transaction result type, i.e., successful uploads, successful downloads, and then each of the various error conditions. Some error conditions will occur fairly normally, like reset connections (usually the app crashing) or dropped (timed-out) connections (the computer crashing or connection going down). Other connection anomalies will be red flags of very obvious DoS attempts or just bad code. These addresses can then be banned (temporarily or permanently) either at the application level or by sending them to iptables (if you're on Linux), which will be more efficient. The Linux app fail2ban is very nice for this type of thing and would only require that you output your security failures to some log file and perform some minimal configuration to implement. You will also want to ban anybody with an excessive number of connections that resulted in some failure.
Reliability (Correctness) of Data
So I've only recently found out about this project, but it would appear that data integrity and correctness of calculations would be the major barrier to opening up the source code to allow the community to participate in development, debugging, submit patches, etc. Unfortunately, I do not have a fully open source solution to offer, but I do have a mostly open source solution.

First off, understand that if any descent hacker wanted to submit invalid results, it wouldn't be terribly difficult, unless you are encrypting the data. But even that would only add to the difficulty, it wouldn't make it impossible. If an application is running on a foreign machine, it can be debugged, memory examined & altered, assembly code reverse engineered, etc. Given a particular network session, such a malicious hacker could intercept the public key from the network stream and/or application memory, reverse engineer the code that performs the encryption and replicate that process, feeding invalid data (or simply altering the data in the process prior to it's encryption). Security through obscurity is only a deterrent, it is can never be completely effective. Fortunately, these malicious types tend to prey on large corporations who have earned public angst, and not innocuous university programs who are seeking to benefit human health.

So my point here is that, as best as I can tell, the client application is already vulnerable to this type of garbage, although it may not be easy. However, I would be surprised (as well as saddened and disappointed) if anybody cared enough to go through all of the trouble required to cause this type of problem. So what I propose is a solution that is as difficult, or nearly so, to compromise.
  1. There should be a development/test server. Data uploaded to this server should not be used for any science! It should exist to support community development. Additionally, this server should give work that has already been performed in production and the development server would serve as a validation means to catch changes that have broken calculations. This is the only server you can communicate with if you have compiled your own code.
  2. All communications to the production server should be encrypted. You can only use this server if you have downloaded an official (pre-compiled) binary.
  3. An abstraction should be added for server communications, allowing a selection of dynamically linked modules. A separate development & production server communications module should be provided. Another benefit of dynamic linking is that it allows you to use otherwise incompatible licenses in the same process space (e.g., GPL & proprietary EULA).
  4. The production server communications module would have to remain closed source, not to make compromising it impossible -- (I hope I've already demonstrated that such is an unattainable goal), but to deter such compromises and limit the number of people who would be capable of it (i.e., only the highly skilled & dedicated hackers -- of which, I hope there are none with the desire to attack a program with such a noble cause). There are a wide variety of ugly "security through obscurity" mechanisms that can be further used to make reverse engineering this module more difficult, which I wont get into here. Of importance, the server communications module should scrutinize the process space in which it's running, including:
    • Carefully examine all dynamically linked modules
    • Attempt to determine if the process is being debugged or not
    • Perform various hash calculations on the primary executable image as well as every other DLL/shared object in the process space
    All hash codes should be transmitted to the server, as well as a report of any anomalous conditions -- this isn't just helpful for recording compromise attempts, it's also helpful to determine false positives due to normal variations in user environment. Only data produced using "officially released" binaries should be accepted by this server.
  5. Link statically to a libc that uses heap randomization (really, this is only a small help though)
  6. Finally, a development communications module should be provided for talking to the development server.
Unless there are other considerations for keeping the sources closed, I believe that these remedies will facilitate the folding@home project to open the majority of sources to the public, which should quite nicely boost development. Then, the maintainers can pick & choose the patches that are effective, tested, verified, etc.

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 12:03 am
by anandhanju
Hello daniel.santos, welcome to the forum.

Have you tried running the GPU2 clients on Linux via WINE? There are a number of topics here that provide instructions for this unsupported, yet fully functioning setup.

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 12:11 am
by daniel.santos
w00t! Thanks for this info! :) I considered attempting that, but I hadn't found any info on it yet and I didn't feel brave enough to pioneer it. I hadn't thought to google "wine" on the forum yet. I'll check it out.

I've actually done a bit of Wine development, but I've never touched Direct3D, so I know pretty much nothing about that portion of the Wine API or what is even involved in getting the GPU to do your dirty work (non-graphics) in Windows.

Thanks!

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 12:26 am
by 7im
We appreciate your enthusiasm and open source evangelism, but the few times that Pande Group opened up a part of fah for development, there weren't any takers. :(

However, Fah is based on Gromacs code, which IS open source. See www.gromacs.org.

See also: Folding@home: Open Source FAQ

Besides, Pande Group has also hired a professional software development group to help update the software top to bottom so open sourcing isn't really needed, and avoids all of its pitfalls.

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 1:53 am
by VijayPande
Yes, please check out our Open Source FAQ. Most of our key elements are open sourced. If you're curious about GPU code, check out the OpenMM package.

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 6:50 am
by daniel.santos
Thanks for the replies and links! I'm sad to hear you had no takers in your previous attempts to utilize the open source community. I did read part of the Open Source FAQ before, but perhaps I should be more clear. What I'm specifically intending to address are issues stated in the

Code: Select all

http://folding.stanford.edu/English/FAQ-main#ntoc12
Why don't you post the source code?
section of the Main FAQ. Without being able to compile a fully working client from source code (that will process the same work files that a production version of the client can), you will loose the interest of most open source developers. In my experience, a bug in one area of the code or shared library (DLL) where sources are not available can frustrate the development & debugging process. Even if it turns out the bug isn't in the closed source executable, determining this for certain is more difficult when that code is not available, and cannot be stepped through, traced, etc. In fact, the way most open source projects work is that there is a small group of project maintainers. They are the gate-keepers of the project. They decide which patches are used and which are discarded. Having a mostly open source client (everything but server communications) could facilitate precisely such a paradigm, where community developers experiment, test against a test server and then submit patches.

Obviously, the need for data integrity is paramount, so I concur that no compiled binary should be allowed to upload to a production server (i.e., "real server") that has not been compiled, tested and validated by your professional development and release management teams. I may not have been clear enough about the purpose of the proposed server communication module's examination of the process environment & hash values of the executable image -- this is strictly for the purpose of validating that the executable is an officially released one and not somebody's hack job.

And while I by no means would wish to disparage any of the professionals on your development team, every day that I work on open source projects, I'm amazed by the raw talent I encounter -- and I'm not easily impressed. The vast majority of "professionals" I've worked with, regardless of degree, I've found quite unimpressive. :( Also, most (serious) open source developers I know are indeed professionals of 10 years or more.

I'm simply proposing an alternative that will allow anybody to compile a working f@h client and experiment with it, but will provide a very high level of assurance (perhaps as high as what currently exists) that no data can be uploaded that was not produced by an executable that has passed your rigorous testing, approval and release processes. In addition, I'm proposing that the value of this will likely be much greater than the costs.

Another observation I have is that the GCC compiler has gotten really good with SIMD instructions. Given the money that m$ has and their previous compilers, I would presume that the m$ compiler is as good if not better. I admit that I am less than a mediocre x86 ASM programmer (I mostly stopped doing assembly after the 6502!), but my hand-coded SIMD was never better than gcc's. As such, I can't help but to feel curious about the current compilers' ability to best ASM programmers (and to you programmers, please don't take offense!!!! :) ).

Re: Licensing and source code availability considerations

Posted: Thu Feb 04, 2010 8:59 pm
by 7im
Spend some time on gromacs.org. Their hand coding smokes.

Plus, I think you are barking up the wrong tree here. The fah client is a basic file handler, uploading and downloading work units. The fah client doesn't do any work, so open sourcing the fah client is a waste of your time and talent, IMO. 99% of the work is done by the fahcore files, and those are 90+% based on Gromacs open source code and other open source projects.

If you want to make the project better, make the project go faster. Donate time and talent to those groups listed on the open source FAQ page. When their open source protein simulation code works faster, it translates to a speed up in the Fah project. ;)