Mission-Critical Systems Management

Innovative strategies for managing 24x7 global distributed computing environments

Yuval Lirov

Publisher: Prentice Hall, 1997, 450 pages

ISBN: 0-13-240292-0

Keywords: Information Security, System Administration

Last modified: May 24, 2021, 4:07 p.m.

Straight from Wall Street: The complete guide to managing mission-critical distributed systems.

For nearly a decade, Yuval Lirov and a team of experts at top Wall Street investment house Lehman Brothers have worked to develop production standards that combine the value and flexibility of distributed systems with the reliability of traditional mainframes. In Mission-Critical Systems Management, you'll see the powerful results they've achieved — and learn how to leverage their ideas in your own organization to:

  • Improve performance
  • Increase availability
  • Reduce costs

For the distributed systems manager, Mission-Critical Systems Management is an unparalleled idea book, bringing together state-of-the-art industry experience in systems administration, database administration, production batch cycles, and much more. You'll learn design techniques for building more effective systems management tools. The book also includes:

  • Architectures and code excerpts for system developers
  • Framework for system integrators
  • Practical advice on purchasing and support
  • New ideas for aligning IS with the requirements of users

Mission-Critical Systems Management includes contributions from experts at Sun Microsystems, Hewlett-Packard, Cray, MIT, and other leading-edge institutions, making it your single source for the world's best ideas in distributed systems management. If you're in the trenches, battling to manage distributed systems of unprecedented complexity, this book can save your career — and your sanity.

      • Introduction
        • Source of Difficulties
        • Support Strategy
          • Accountability
          • Three-Tiered Architecture
          • Integration
        • Key Results
          • Support of Growing Demand While Reducing Support Costs
          • Performance Improvements
          • Reliability Improvements
        • Conclusions
          • Distributed Approach
          • Centralized Approach
          • References
    • Part I: Cost Control
      • Chapter 1: The Spiraling Costs of Systems Management
        Frank Henderson
        • Introduction
        • Network Management vs. Systems Management (It's a Thin Line)
        • Complications of Distributed Environments
        • Networked Security
        • Other Manifestations of the Support Cost Issue
        • Options
        • Summary
          • References
      • Chapter 2: IT Service Management in the Distributed Enterprise
        Doug McBride
        • Introduction
        • Background
        • Service Management Is a Strategy
        • Service Level Agreements
        • Service Level Objectives
        • The Right Metric for the Job
        • Service Management Support Tools
        • Summary
        • Appendix — Eight Steps to a Succesful Service Level Agreement 
      • Chapter 3: Buy vs. Build vs. Sell in Distributed Systems Management
        Aaron Goldberg and Yuval Lirov
        • Introduction
        • Modeling the Buy vs. Build Decision
          • Baseline Model
          • Feasability Constraints
          • Implicit Costs
        • Applying the Model to Real-World Decisions
          • Trouble Ticket System
          • Mail Alias Management
          • System Monitoring
          • Batch Scheduling
        • Conclusions
          • References
    • Part II: Automation
      • Chapter 4: Distributed Systems Monitoring
        Aaron Goldberg, Boris Grinfeld, Baruch Katz, Yuval Lirov, and Hadil Sabbagh
        • Introduction
        • Configuration Management
        • Monitors
          • Alerts
          • Host Availability Monitor
          • Log File Monitoring
          • Other Monitors
        • Summary
          • References
      • Chapter 5: Fault Management
        Martha Ben-Michael, Yuval Lirov, and John O'Donnell
        • Introduction
        • Architecture
          • Multidimensional Object Classification (MOC)
          • The Action Configuration System (ACS)
          • The Action Resolver Program
          • Filtering
        • Configuration Management
          • Configuration of Hosts
          • Configuration of Dataservers
          • Complexity
        • Conclusions
          • References
      • Chapter 6: Problem Management
        Martha Ben-Michael, David Freudenstein, Aaron Goldberg, and Yuval Lirov
        • Overview: Distributed vs. Centralized Support Models
          • The Support Productivity Metric
          • The Centralized Model
          • The Distributed Model
          • An Illustration
          • Automation
        • Problem Management Systems Design
          • Requirements
          • Escalation
          • Ticket Priorities and Corporate Culture
          • Trend Identification
          • Solutions Knowledge-Base: Historical vs. Active
          • Notifications: Tracking the Work
          • User Interface
          • Email Ticket Submission
          • Command-Line Ticket Submission Interface
          • Reports
          • Scalability
          • RTDBS User Directory Maintenance
          • Generic DBMS Maintenance: Failover and Backup of the RTDBS
        • RTDBS Implementation in Practice
          • Organizational Objectives
          • Inclusion of the Users of the System in the Project Life Cycle
          • Systems Architecture
          • RTDBS Web Interface
        • Conclusion: Business Impact and Mission-Critical Benefits to the Organization
          • Acknowledgments
          • References
      • Chapter 7: Notification Management
        Boris Grinfeld, Yuval Lirov, Andrew Sherman, and Frank Wadelton
        • Introduction
        • Systems Administration 24 Hours 7 Days a Week
          • Team-Based Systems Administration — A New Paradigm
          • Roster Management and Notification Requirements for Team-Based Support
          • Global Support and Teams That Spawns Oceans
        • Notification Knowledge Representation
        • Implementation
          • Knowledge Management
          • Automated Knowledge Acquisition
          • Examples
          • Browsing the Knowledge Base — Who Ya Gonna Call?
        • Summary
          • References
        • Appendix — Bing Command Summaries
          • A1. Bhelp
          • A2. Bing
          • A3. Btell
          • A4. Bfind
          • A5. Blist
          • A6. Bwho
          • A7. Bupdate
          • A8. Bnotify
          • A9. Back
          • A10. Bdep
          • A11. Bswap
          • A12. Badd
          • A13. Bdel
          • A14. Bgen
      • Chapter 8: Configuration Management for Distributed Systems Support
        Yuval Lirov and Andrew Rieger
        • Introduction
        • Design
          • Requirements
          • Data Collection
          • Data Access
        • Implementation
          • Data Collection
          • Data Retrieval
        • Conclusion
          • Current Use
          • Future Directions
          • References
    • Part III: System Administration
      • Chapter 9: Cooperative Enterprise Administration
        Mike Gionfriddo
        • Introduction
        • Cooperative Enterprise Administration Model
        • Appliances
          • Caching File Systems
          • Solstice AutoClient
          • AutoClient Operations
        • Performance
        • Management Components
          • AutoClient Server
          • Configuration Management
          • Data Management
        • Conclusions
          • References
      • Chapter 10: Surviving Large-Scale Distributed Computing Powerdowns
        Yuval Lirov and Andrew Sherman
        • Introduction
        • The Time_Critical Business Environment
        • Problems and Risks
        • Summary
        • Experience
        • Appendix A — Boot-up Check Disk Script
        • Appendix B — Powerdown Census Script
        • Appendix C — Check Disk Script Used by Census
      • Chapter 11: RAID — More Than Data Protection
        David Reilly
        • The Demand for I/O Performance
        • Basic Benefits of Disk Arrays
        • The Origins of RAID
          • RAID Level 1 — Disk Mirroring
          • RAID Level 5 — Independent Access
          • RAID 0 + 1 — Striping Plus Mirroring
          • Understanding Parity
        • Extending Data Availability and Reliability
        • A Word About RAID Performance
        • Enhancing RAID System Performance with Cache
          • New Technologies Promise Greater Performance
          • What Performance Gains Can Be Expected?
          • Figure Credits
      • Chapter 12: Networked Systems Administration
        Frank Henderson and Dave Koehler
        • The Networked System Administrator
        • Defining the Problem
          • Fault Management with SNMP
          • Real-Time Fault Management
          • Performance Management
          • Asset Management
          • Configuration Management
        • Remote Monitoring (RMON)
          • RMON Applied
          • RMON — More Management for the Dollar
        • Streamlining Administration of the Ubiquitous TCP/IP Network
          • Technology Overview
          • Building a TCP/IP Server Architecture
          • TCP/IP Communication Package
          • TCP/IP Addressing and Naming Standards
          • TCP/IP Administration Architecture
          • DHCP/BootP and TFTP Server Architecture
          • DHCP/BootP Directory Structure
          • DNS Server Architecture
          • DNS Database Directory Structure
          • References
      • Chapter 13: Performance Management
        Vincent Kasten
        • Introduction
        • Terminology
          • An Example System
          • Applications, Platforms, and Systems
          • Transactions and Performance Measures
          • Application Performance
          • Platform Performance Measures
        • Components of Performance
          • CPU
          • Disk
          • Memory
          • Network
        • Proactive Performance Management
          • Sources of Performance Problems
          • Characterizing the Application and the Workload
          • Routine Measurement and Evaluation
        • Summary
          • References
      • Chapter 14: Introduction to Network Security
        Theodore Ts'o
        • Introduction to Network Security
          • Motivation
          • Threat Analysis
          • Approaches to Network Security
          • Introduction to Authentication
        • Token Authentication Systems
          • Introduction
          • Secure-ID
          • DES-Based Token Systems
          • S/Key
          • Summary
        • An Introduction to Cryptography
          • Introduction
          • Encryption Systems
          • Cryptographic Checksums
          • Random Number Generators
          • Summary
        • Kerberos
          • Introduction
          • How Kerberos Works
          • Kerberos V5
          • Where to Get Kerberos?
          • Summary
        • Public-Key Systems
          • Introduction
          • The RSA Public-Key Cryptosystem
          • Public-Key Certificates
          • Certificate Hierarchies
          • The PGP Model
          • Summary
        • Kerberos and Public-Key Systems
          • Introduction
          • Kerberos
          • Public-Key Systems
          • Red Herrings
          • Integrating Public Key into Kerberos
          • Summary
        • Conclusion
    • Part IV: Database Administration
      • Chapter 15: Quantifying Database Support Quality
        Aaron Goldberg, Yuval Lirov, and Maxwell Riggsbee
        • Introduction
        • Defining Database Availability
          • System vs. Application Availability
          • Dataserver Outage vs. Data Outage
          • Outage Classes
        • Measuring Database Availability
          • Dataserver Availability
          • Dataserver Outage vs. Data Outage
          • Outage Classes
        • Critique
          • References
      • Chapter 16: Multiple Performance Gains at Low Buffered I/O Cost
        Aaron Goldberg, Joyce Lee, Yuval Lirov, and Maxwell Riggsbee
        • Introduction
        • The Original Problem
        • Solution Space
        • Evaluation
        • Further Applications
        • Conclusions
          • References
      • Chapter 17: Physical Database Performance Tuning Methodology
        Mike Simone
        • Introduction
        • Memory
          • Buffer Pool and Sort Memory Tuning Methodology
          • Buffer Pool Size
          • Sort Heap and Sort Heap Threshold
        • Physical I/O
          • AIX File System Placement
          • Table and Index Maintenance
        • Utilities
        • Conclusion
          • Notes
        • Appendix — DB2/AIX Reorgchk Report
      • Chapter 18: Data Replication
        Marie Buretta
        • An Overview of Data Replication
        • Definition of Replication
          • Types of Replication Solutions
          • Important Issues to Consider When Using Asynchronous Replication
        • Why Should Replication Be Implemented as an Enterprise Service?
      • Chapter 19: DataCompass — A Three-Tiered Database Support Model
        Martha Ben-Michael, Rob Chase, Yuval Lirov, and Maxwell Riggsbee
        • Introduction
        • Two-Tiered Client-Server Model — The Problem
        • The Three-Tiered Model — A Solution
        • Availability, Scalability, and Performance
          • Availability
          • Scalability
          • Performance
        • Architecture
        • Application of DataCompass in Practice
          • DataCompass Use in Application Outage Troubleshooting
        • DataCompass as a DBA Utility
        • Conclusions
    • Part V: Parallel Batch Administration
      • Chapter 20: Distributed Batch Administration
        G. Larry Chen and Yuval Lirov
        • Introduction
        • Business Requirements for Nightly Production
        • Approach — Centralized Parallel Batch Administration
          • Centralized vs. Distributed Administration
          • Prototype Architecture
          • Performance Metrics
        • Key Issues in Distributed Batch Administration
          • Reliability of Schedulers
          • Component Redundancy
          • Instance Redundancy
          • Dynamic Scheduling and Parallel Computing
          • Production Monitoring
        • Parallel Batch Administration
          • Batch Job Assimilation
          • Production Monitoring
        • Summary
          • Acknowledgment
          • References
        • Appendix — Job Submission Form
      • Chapter 21: Distributed Workload Management with NQE
        Neil Bannister, Daryl Coulthart, Dan Ferber, and Laraine MacKenzie
        • Challenges for Workload Distribution
        • Meeting the Challenges with the NQE Workload Management Model
          • Client Interfaces
          • Execution Server Functions
          • Job Execution
        • The Model at Work: Case Studies
          • Case 1: Global File Distributions
          • Case 2: Fair Sharing
          • Case 3: Nationwide Sharing
          • Case 4: Heterogeneous Load Balancing
          • Case 5: Load Balancing for Terminal Sessions
          • Case 6: IBM MVS Support
          • Case 7: Batch Scheduling
        • The Future of Workload Management with NQE
        • Conclusions
      • Chapter 22: Mission-Critical Client-Server Application Testing
        Jayaram Bhat
        • Introduction
          • Risks to Client-Server Systems
          • Managing Risk in Client-Server Systems
        • Client-Server Testing Requirements
          1. Client GUI Testing
          2. Middleware Inspection and Analysis
          3. Client Load Testing
          4. Server Load Testing
          5. Client-Server Load Testing
          6. Cross-Platform Support
          7. Managing the Test Process
        • Testing Throughout the Application Development Life Cycle
          • Load Testing
          • Benchmarking
          • Development and Modeling
          • System Testing
          • Performance Testing
          • Acceptance Testing
          • System Enhancement and Upgrade Testing
          • Production Testing
          • Capacity Testing
        • Summary
          • Notes
          • References
        • Appendix — An Example Integrated Testing Suite
      • Chapter 23: Batch Testing of Multiuser Interactive Applications
        Aaron Goldberg and Yuval Lirov
        • Introduction
        • Remote Batch Test of Interactive Applications
        • PASTEL
        • Characterizing the Test with Performance Signatures
        • PASTEL and Daily Integrity Checks
        • Experience and Future Work
          • References
      • Chapter 24: Enterprise-Wide Backups
        Donald Gertler and Yuval Lirov
        • Introduction
        • Centralized Backups
          • Overview
          • Process Flow of Lehman Brothers' Networker Backups
          • Non-UNIX Partitions
          • Capacity and Performance Concerns
          • Problems and Issues
          • Market Options
        • Incremental Local Backup — rdump(8)
          • Overall Architecture of a run_dump Domain
          • Problems with Naive Dumps
          • Why Incremental Dumps?
          • The run_dump Script
          • Configurable Dump Levels and Overrides
          • Fault Handling
        • Comparison Matrix
        • Conclusion
          • Acknowledgments
          • References
        • Appendix A — /etc/dump.conf Configuration File
        • Appendix B — fsfail Script
        • Appendix C — /var/adm/dump.dir/dump.YYMMDD
      • Chapter 25: Content-Independent Software Distribution
        Donald Gertler, Tim Li, and Yuval Lirov
        • Introduction
        • Architecture
        • Distribution Parameters
          • Versioning
          • Dynamic Source Hosts
          • Dynamic Target Machine Lists
          • Dynamic Component Lists
          • Scheduling
          • Specifying Distribution Parameters Through the Centralized GUI
        • Action Plan and Predeployment
          • Action Plan Generation
          • Dependency Verification
          • Predeployment Preparation
          • Locking
        • Deployment
          • Staged Deployments
          • Data Transfer Mechanism
          • Advantages of Standardized Batch Execution Mode
        • Postdeployment
          • Postdeployment Package Check
          • Postdeployment Massaging
          • Back-out Capabilities
        • Ongoing Processing and Maintenance
          • Periodic Verification
          • Package Removal
          • Deployment Tool Availability and OS Compatibility
          • Disaster Recovery/Avoidance
        • Conclusions

Reviews

Mission-Critical Systems Management

Reviewed by Roland Buresund

Bad ** (2 out of 10)

Last modified: May 21, 2007, 3:12 a.m.

The so called "author" is in reality just an editor/co-author of a number of vendor written articles. Qualified trash, but what can you expect when it is based on the experience of an investment bank?.

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required

captcha

required