Subject: RE: More Linux - the geek stuff Fri Jan 29 22:25:55 1999 > > this post has gotten long enough that i won't go into the gory > > details (those of you who bet that i *don't* know how to shut up > > can pay up, and face the terrifying concept that this is > > *voluntary* behavior.. ;-), but i'll be happy to provide details > > if you want them. > > Hey, if you don't have any nagging tendonitis, feel free.... well, i'm thoroughly into rant mode at this point, so what the heck. the first piece of the puzzle is the Apache SSI system, specifically the directive: this is the preferred way to include another file, or the output of an executable, in the current page. the filepath is restricted to the local machine, but you can work around that with the proxying system, which is the second piece. the Apache proxy system does all sorts of stuff, but one of its directives is called ProxyPass, which takes requests for files in a given directory, and passes them to another server for handling. the upshot is that you can proxy your entire cgi-bin directory off to another machine, but as far as the SSI system is concerned, the scripts are still local. when a user requests a page, the SSI system will see the virtual include directive, and do an internal request for another server thread to run the script and return output. that thread will see that the cgi-bin directory has been proxied. instead of trying to acquire and run the script itself, it merely opens a connection to the remote server and lets that do the processing. output from the remote server comes back through the proxy thread, then gets passed back to the original server thread, and sent to the user: front-end server ------------------------------- [ main thread ] [ proxy thread ] [ remote server ] | | | page request | | | | ------------->| | SSI request | | | --------------->| | proxy request | | | | --------------->| | | | | | | ---/ processing | | | | output | |<- | | output | |<--------------- | output | |<--------------- | | <-------------- | | | | | | granted, the main thread does have to wait while the other threads are doing their business, but most of that is idle time the server can use to handle other requests. most of the lag time in the transaction will belong to the remote server which does the actual processing, which brings us to the third piece of the puzzle. the way to eliminate bottlenecks at the remote server is to increase the effective processing speed of that server. the nice thing about web requests is that they tend to be more or less independent of each other. that makes them good candidates for parallel processing, which means you can get better performance by throwing hardware at the problem. the biggest trick is finding a way to have the front-end server query a whole group of remote servers to find the one that has the most free time at the moment. if you really want to, you can but a $12K load-balancing server from Cisco, which will make sure every back-end server sees almost exactly as much traffic as any other. OTOH, you can just toss a couple extra lines in your DNS files, and the network will take care of things on its own. what most people don't know is that it's perfectly legal to give more than one machine the same name in a DNS file. if my host file looks like so: remote.foo.com. IN A 10.0.0.100 remote.foo.com. IN A 10.0.0.102 remote.foo.com. IN A 10.0.0.103 any request for 'remote.foo.com' will be sent equally to all three machines. the client making the request will set up a connection with the one that answers first, and ignore the others. just by the simple voodoo of process scheduling and network topology, the load will more or less balance out across all three machines. the upside is that the busier a machine happens to be, the less likely it is to be the first one that answers. therefore, the network tends to balance out its processing load across the machines in good Marxist fashion.. from each according to its abilities, to each according to its needs. the capacity to duplicate machines also operates with regard to the front-end servers. webpage requests are by definition independent of each other, so you can have multiple front-end machines proxying requests off to the same back-end server. the minimum configuration for a fairly robust system is to have two identical machines at the front passing requests to two identical machines at the back. you can put a .45 slug through the CPU of any single machine, and the system as a whole will continue to operate. by isolating processing to a specific group of machines, you can build parallelized subnodes of a larger cluster. once you have those, you can tune the performance of the subnodes to meet the demands of the cluster as a whole. as you get into other protocols or specialized daemons, you introduce more ways for machines to share information, and increase the performance or feature set of the cluster. you can do quite a lot in a webserver farm with just the three pieces i've already mentioned, though. and the good news is that you can do it with a stock Apache installation and ordinary DNS.