[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[no subject]
Bjorn Dittmer-Roche wrote:
>
> On Tue, 20 Apr 2004, Jeffrey B. Layton wrote:
>
>
>>Well, my response is - it depends. How long is long? How important
>>is it to you? Can you checkpoint or modify the code to checkpoint?
>>Unfortunately, there are questions you have to answer. However, let
>>me give you some things I think about.
>>
>>We run CFD codes (Computational Fluid Dynamics) to explore
>>fluid flow over and in aircraft. The runs can last up to about 48
>>hours. Our codes checkpoint themselves, so if we lose the nodes
>>(or a node since we're running MPI codes), we just back up to the
>>last checkpoint. Not a big deal. However, if we didn't checkpoint,
>>I would think about it a bit. 48 hours is long time. If the cluster
>>dies at 47:59 I would be very upset. However, if we're running
>>on a cluster with 256 nodes with UPS and if getting rid of UPS
>>means I can get 60 more nodes, then perhaps I could just run my
>>job on my more nodes and get done faster (reducing the window
>>of vulnerability if you will).
>
>
> Jeff touches on an important point here: what happens when you loose one
> node? You should think about the hardware's MTBF and think about how often
> you will loose a single node and what the consequences of that are. If
> your computations run for a week without checkpoints and you have a lot of
> nodes, you will have to worry about hardware failure as well as power. So
> good coding practice involves checkpoints.
>
> At the risk of getting flamed: Have you considered alternative
> multiprocessor machines from Sun, SGI and the like? These systems have
> great reliability and let you do things like put 60 G RAM on one machine.
>
>
>>You also need to think about how long the UPS' will last. If you
>>need to run 48 hours and the UPS kicks in about 24 hours, will
>>the UPS last 24 hours? If not, you will lose the job anyway (with
>>no check pointing) unless you get some really big UPS'. So in this
>>case, UPS won't help much. However, it would help if you were
>>only a few minutes away from completing a computation and
>>just needed to finish (if it's a long run, the odds are this scenario
>>won't happen often). If you could just touch a file and have your
>>code recognize this so it could quickly check point, then a UPS
>>might be worth it (some of our codes do this).
>
>
> Most power problems where I used to work were very brief. I don't know
> about what things are like here in Georgia, or weather or not you have
> backup generators, but a UPS that gives you 30 seconds will get you
> through a lot of tough spots and will save you from loosing your
> computations because of a ten second power outage. If you want to ride
> over major blackouts, a small UPS and a generator will be more cost
> effective than a large UPS, but again, what's the point when your node
> MTBF is on the same order as the frequency of power outages.
>
> bjorn
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> <a rel="nofollow" href="http://www.ale.org/mailman/listinfo/ale">http://www.ale.org/mailman/listinfo/ale</a>
>
--
__________________________________________________________
Dow Hurst Office: 770-499-3428 *
Systems Support Specialist Fax: 770-423-6744 *
1000 Chastain Rd. Bldg. 12 *
Chemistry Department SC428 Email: dhurst at kennesaw.edu *
Kennesaw State University Dow.Hurst at mindspring.com *
Kennesaw, GA 30144 *
************************************************************
This message (including any attachments) contains *
confidential information intended for a specific individual*
and purpose, and is protected by law. If you are not the *
intended recipient, you should delete this message and are *
hereby notified that any disclosure, copying, distribution *
of this message, or the taking of any action based on it, *
is strictly prohibited. *
************************************************************
</pre>
<!--X-Body-of-Message-End-->
<!--X-MsgBody-End-->
<!--X-Follow-Ups-->
<hr>
<!--X-Follow-Ups-End-->
<!--X-References-->
<ul><li><strong>References</strong>:
<ul>
<li><strong><a name="00736" href="msg00736.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> ChrisColeman at mail.clayton.edu (Chris Coleman)</li></ul></li>
<li><strong><a name="00743" href="msg00743.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
<li><strong><a name="00786" href="msg00786.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> dhurst at kennesaw.edu (Dow Hurst)</li></ul></li>
<li><strong><a name="00807" href="msg00807.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> laytonjb at charter.net (Jeffrey B. Layton)</li></ul></li>
<li><strong><a name="00815" href="msg00815.html">[ale] Linux Cluster Server Room</a></strong>
<ul><li><em>From:</em> bjorn at sccs.swarthmore.edu (Bjorn Dittmer-Roche)</li></ul></li>
</ul></li></ul>
<!--X-References-End-->
<!--X-BotPNI-->
<ul>
<li>Prev by Date:
<strong><a href="msg00819.html">[ale] [OT] Microsoft Distributed File System (DFS)</a></strong>
</li>
<li>Next by Date:
<strong><a href="msg00821.html">[ale] LDAP Server</a></strong>
</li>
<li>Previous by thread:
<strong><a href="msg00815.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Next by thread:
<strong><a href="msg01101.html">[ale] Linux Cluster Server Room</a></strong>
</li>
<li>Index(es):
<ul>
<li><a href="maillist.html#00820"><strong>Date</strong></a></li>
<li><a href="threads.html#00820"><strong>Thread</strong></a></li>
</ul>
</li>
</ul>
<!--X-BotPNI-End-->
<!--X-User-Footer-->
<!--X-User-Footer-End-->
</body>
</html>