Must-Read for IT Infrastructure Engineers: Complete Guide to Achieve Non-Functional Requirements

IT Infrastructure Engineer Job Overview

Infrastructure engineers in the IT industry are responsible for designing, developing, and operating system infrastructures. Their main role is to meet the non-functional requirements of IT systems.

Non-functional requirements refer to requirements other than the system’s business functions and are generally classified as follows:

  • Availability
  • Performance and Scalability
  • Operability and Maintainability
  • Migratability
  • Security
  • System environment and ecology

Infrastructure engineers utilize technical expertise and ingenuity to fulfill non-functional requirements in their work. When providing their own services, they can clearly define the non-functional requirements themselves and proceed with system development based on them. However, the situation is different in system development projects commissioned by clients. It is not uncommon for clients to be unable to clearly define non-functional requirements. Therefore, infrastructure engineers are also expected to take on a consultant-like role. They need the ability to collaborate with clients to organize and review non-functional requirements and support the requirements definition process.

The scope of responsibilities for infrastructure engineers varies depending on the project. Generally, they handle physical equipment such as servers, networks, and storage, as well as middleware like virtualization software, databases, and application servers. In recent years, more companies have been adopting cloud services, making the design and operation of cloud environments an important part of their work. Infrastructure engineers cover a wide range of technical areas and need to continuously learn the latest knowledge, including container technologies and Infrastructure as Code (IaC). For those who enjoy technology, this field can be highly rewarding.

What is Non-Functional Requirements?

In Japan, it is common to refer to the “Non-Functional Requirement Grades” published by the Information-Technology Promotion Agency (IPA), an independent administrative institution, for non-functional requirements. Even if a project does not have its own defined non-functional requirements, utilizing these grades can help ensure a certain level of quality. Since they are useful for clarifying non-functional requirements, active use is recommended.
Reference link:
https://www.ipa.go.jp/archive/digital/iot-en-ci/jyouryuu/hikinou/ps6vr700000077he-att/000028844.zip

Purpose of Introducing Non-Functional Requirement Grades

It is extremely important for vendors building IT systems and their clients who commission the work to define and agree upon non-functional requirements in advance. This is because if any discrepancies in understanding are discovered after the system has reached a certain stage of completion, rework may be required, which can significantly impact costs and delivery schedules. Furthermore, insufficient hardware specifications can lead to serious issues where the final system fails to meet the requirements. However, clients who are not specialists in IT system development often find it difficult to determine which non-functional requirements to specify and at what levels. In such cases, the “Non-Functional Requirement Grades” provided by the IPA prove to be effective. By utilizing this document, vendors and clients can reach specific and clear agreements on non-functional requirements, thereby preventing troubles caused by misunderstandings later on.

Overview of Non-Functional Requirement Grades

The Non-Functional Requirement Grades consist of six categories: availability, performance and scalability, operability and maintainability, Migratability, security, and system environment and ecology. In recent years, as more systems are built on cloud environments, there may be fewer considerations related to the “system environment and ecology” category compared to on-premises environments. However, it remains important to carefully consider the other categories and reach agreement among stakeholders.

Non-Functional Requirements
Major Categories
Description
AvailabilityRequirements to ensure continuous availability of system services. Defines system uptime targets and operational goals in the event of failures or disasters.
Performance and ScalabilityRequirements regarding system performance and future expansion. Defines performance targets such as throughput and response time, as well as requirements considering future workload growth.
Operability and MaintainabilityRequirements related to system operations and maintenance services. Defines requirements for operational tasks such as backups, system monitoring methods, and various maintenance activities.
MigratabilityRequirements related to the migration of existing system assets. Defines migration methods, scope, and schedules.
SecurityRequirements to ensure the security of information systems. Defines compliance rules and measures to address security risks.
System environment and ecologyRequirements related to system installation environments and ecological considerations. Defines various constraints related to installation environments, such as earthquake resistance and seismic isolation.

Explanation of Each Item in the Non-Functional Requirement Grades

In this section, we will explain the medium and detailed items for each major category of the Non-Functional Requirement Grades described in the previous section. Additionally, we will share practical know-how gained from real experience for your reference.

Please note that the know-how introduced in this article assumes the following conditions:

  • Systems that are relatively large-scale and mission-critical
  • A perspective of building systems from the standpoint of an IT vendor

Availability

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
AvailabilityContinuity– Operation schedule
– system operating/stop time
-business areas requiring continuity
– recovery level in case of failure
-system availability rate
Fault tolerance– Redundancy of servers, terminals, network devices, networks, storage, etc.
– Data backup methods and recovery scope
Disaster countermeasures– Requirements for business continuity during large-scale disasters
– Requirements for data storage during large-scale disasters
Recoverability– Recovery work content during large-scale disasters
– Availability verification scope

<Continuity>

  • When there are multiple processing methods such as online processing and batch processing, it is advisable to establish a business continuity policy for each.
  • From a business continuity perspective, clearly identify any SPOFs (Single Points of Failure). It is common to ensure business continuity through redundant configurations for servers and switches. However, for example, many systems adopt a single configuration for storage enclosures.
  • For devices using an Active-Standby configuration, agree in advance on the time required for switching. In an Active-Standby setup, temporary business errors may occur during failover. To avoid disputes with customers when troubles arise after operation begins, it is important to clearly define the expected failover time and verify through testing that the requirements are met.
  • Regarding availability, carefully reach an agreement with the customer. Mission-critical systems may require a high availability of 99.999% (approximately 5 minutes of downtime per year), but achieving this in open systems is extremely challenging. Also, it is currently difficult for cloud-based systems to achieve 99.999% availability. Therefore, it is important to thoroughly review the SLA (Service Level Agreement) of each cloud service and carefully consider and set the overall availability.

<Fault Tolerance>

  • For physical equipment, confirm with the hardware vendor whether redundant configurations exist at the component level.

Performance and Scalability

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
Performance and ScalabilityBusiness Workload– Requirements related to workloads that affect performance and scalability
– Requirements regarding anticipated workload growth from system launch through the end of its lifecycle
– Requirements for number of users, concurrent access, data volume, number of online requests, and batch processing volume
Performance Targets– Target values for online and batch response time (time from receiving a request to returning a response)
– Target values for online and batch throughput (amount of processing handled per unit of time, such as number of transactions or data volume)
Resource Scalability– Requirements for CPU and memory utilization
– Requirements for disk and network utilization
– Requirements for methods of increasing server processing capacity (scale-up/scale-out)
Performance Quality Assurance– Whether network bandwidth is guaranteed
– Whether hardware resources are dedicated
– Frequency and scope of performance testing
– Measures for handling spike loads (e.g., limiting concurrent transactions, displaying a “Sorry” page, or temporary service suspension)

<Business Workload>

  • In general web systems, it is common to limit the number of concurrent accesses to protect the system from heavy traffic. It is desirable to implement traffic control at multiple points, such as load balancers and web servers, so it is important to plan in advance which devices will perform what type of control.
  • For the number of online requests, it is important to clearly define the unit of time. For example, designing a system to handle “1,000 requests per second” to accommodate burst traffic may result in over-specification. Setting the requirement to around “10,000 requests per minute” and ensuring that even if burst traffic occurs, the system meets the requirement when viewed on a per-minute basis is generally a safer specification from the IT vendor’s perspective. Be sure to discuss this thoroughly with the customer and obtain their agreement.
  • It is necessary to clarify in advance the expected increase in workload. Especially in on-premises environments, keep in mind that resource expansion may have upper limits.

<Resource Scalability>

  • When evaluating CPU and memory usage, it is important to decide on the measurement unit time in advance. Since CPU usage can fluctuate significantly in short bursts, setting the measurement interval too short may lead to overestimation of required resources.
  • When increasing server processing capacity, it is advisable to determine beforehand, for each server type, whether to scale up (enhancing the performance of existing servers) or scale out (adding more servers). In particular, for physical servers, components such as CPUs, memory, and disks may be discontinued or expansion slots may be fully occupied, which can limit scalability. Careful attention is required.
  • Additionally, select devices that need to have dedicated hardware resources. For example, it is common to build the core database servers as dedicated machines to minimize the impact from other servers.
  • Design the system so that it does not completely shut down even under spike loads. Introducing Distributed Denial of Service (DDoS) mitigation products is also an effective measure.

Operability and Maintainability

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
Operability and MaintainabilityNormal Operations– Operating hours (normal days and specific days such as holidays, weekends, and month-end/start)
– Backups (scope, frequency, retention period, method)
– Operational monitoring (monitored items, monitoring intervals)

– Scope of time synchronization
Maintenance Operations– Whether planned downtime is required
– Reduction of operational workload (scope of automated maintenance tasks)
– Patch application policy (frequency of patch information provision, target systems, timing of application, and whether verification is performed)
– Scope of active maintenance
– Frequency of periodic maintenance
– Frequency of preventive maintenance (e.g., preemptive replacement when failure signs are detected)
Operations During Failures– Recovery procedures (manual, automated tools, or recovery via business applications)
– Presence and scope of alternative operations

– Response to system anomaly detection (available response times, on-site response times)
– Level of spare part availability
Operational Environment– Availability of development and testing environments
– Level of operational manual provision
– Availability of remote operations
– Presence of external system connections
Support Structure– Maintenance contracts (hardware/software)
– Lifecycle period (time until next system renewal)
– Division of maintenance tasks (roles between IT vendors and customers)
– Division of emergency response roles (roles between IT vendors and customers)
– Support personnel allocation
– Whether regular reporting meetings are required and their frequency

<Normal Operation>

  • Clearly define the number of backup generations and retention period in advance. These factors significantly affect disk capacity estimates, so depending on your requirements, you may need to purchase additional disk later. Also, if you have multiple backup targets, we recommend organizing and clearly documenting the requirements for each target.
  • Operational monitoring is extremely important for stable system operation. Carefully consider what to monitor and at what interval. For example, if CPU usage is collected every five minutes, it may be difficult to identify the cause of a problem when it occurs. For important items, consider a shorter collection interval without impacting system resources. Also, keep in mind that for items that cannot be collected using standard commands, you may need to create your own scripts and incorporate them into the monitoring targets.

<Maintenance and Operations>

  • Clearly define patching policies and schedules in advance. Patching requires considerable effort and time, including impact assessments and thorough testing such as aging tests. Except for urgent security vulnerabilities or critical bugs that could cause service outages, patches should be implemented on a planned, long‑term schedule.
  • Consider patching not only the OS and middleware but also device firmware. It is important to establish a system for receiving regular information from device vendors and establish a system for early response.
  • Identify in advance which components cannot undergo active (non‑disruptive) maintenance. Components that support rolling updates can usually be maintained without downtime; however, active‑standby configurations may experience brief interruptions during failover. Similarly, certain changes, such as database table structure modifications, may require a full system shutdown. Because active maintenance can be challenging in these scenarios, it is essential to communicate these limitations to customers beforehand to avoid future issues.

<Production Environment>

  • Work with your customer to establish a development environment that closely mirrors the production environment. Many production issues arise from insufficient testing in less representative development environments. Clearly explain the need for such support and obtain the customer’s agreement early in the project to prevent issues later.

<Support System>

  • Carefully review both hardware and software maintenance contracts. Standard support periods often expire before the end of a system’s lifecycle. Confirm in advance whether extended support is available and clarify the scope and service levels provided after the initial support period ends.

Migratability

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
MigratabilityMigration TimingMigration schedule (migration period, system downtime, presence or absence of parallel operation)
Migration MethodSystem deployment method (whether phased system deployment is used)
Migration Targets (Equipment)– Migration scope (hardware, OS, middleware, etc.)
Migration Targets (Data)– Data volume and format to be migrated
– Migration media (quantity and type)
– Conversion requirements (need for data conversion and tools)
Migration Plan– Work allocation (division of tasks between IT vendor and customer)
– Rehearsals (scope, frequency, presence or absence of external integration testing)
– Troubleshooting (migration contingency plans and organizational structure)

<Migration Timing>

  • Careful consideration is required when planning the migration schedule. While customers naturally want to minimize service downtime as much as possible, it is important for the IT vendor to propose a realistic and feasible schedule that minimizes risk.

Security

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
SecurityPrerequisites and Constraints– Compliance with information security requirements (laws, internal regulations, etc.)
Security Risk AnalysisSecurity risk analysis (identifying threats, scope of impact analysis)
Security AssessmentSecurity assessments (network assessment, web assessment, database assessment, etc.)
Security Risk Management– Review of security risks (frequency and scope)
– Review of security risk countermeasures (addressing threats identified after operations begin)
– Security patching (scope, policy, timing)
Access Control and Usage Restrictions– Authentication functions (presence of authentication, number of authentication attempts)
– Usage restrictions (system-based measures, physical measures)
– Management methods (rule setting and enforcement)
Data ConfidentialityData encryption (requirement for encrypting transmitted data, requirement for encrypting stored data)
Unauthorized Access Tracking and Monitoring– Unauthorized access monitoring (log collection, log retention period, monitoring targets, etc.)
– Data validation (presence of digital signatures, validation frequency)
Network MeasuresNetwork control (whether communication control is implemented)
– Intrusion detection (scope of detection)
– Mitigation of denial-of-service attacks (requirement for network congestion countermeasures)
Malware CountermeasuresMalware protection (scope of protection, requirement for real-time scanning, full-scan frequency)
Web Security Measures
– Web implementation measures (requirement for secure coding, requirement for WAF implementation)
Security Incident Response and Recovery– Requirements for incident response structures
  • Malware protection (scope of protection, requirement for real-time scanning, full-scan frequency)

<Network Countermeasure>

  • In recent years, damage caused by DDoS attacks has become increasingly common, making it advisable to implement countermeasures. Consider using external services such as Akamai or Cloudflare.

System environment and ecology

Major CategoriesMedium CategoriesImportant ItemsKey Sub-Items
Note: Bold indicates important items
System Environment & EcologySystem Constraints / Prerequisites– Constraints during construction (legal regulations, ordinances, internal company policies, etc.)
– Constraints during operation (legal regulations, ordinances, internal company policies, etc.)
System Characteristics– Number of users
– Number of clients
– Number of sites
– Geographic distribution
– Specific product requirements (presence or absence of customer-specified products)

System usage scope
Multilingual support
Applicable Standards– Product safety standards (requirement for certification)
– Environmental protection standards (requirement for certification)
– Electromagnetic interference standards (requirement for certification)
Equipment Installation Environment Conditions– Seismic resistance / vibration isolation (maximum seismic intensity)
– Space (installation space limitations
, room for expansion)
– Weight (floor load capacity, installation measures)
– Electrical equipment compatibility (power supply compatibility, power capacity constraints, power outage countermeasures, etc.)
– Temperature (range)
– Humidity (range)
– Air conditioning performance
Environmental Management– Measures to reduce environmental impact (use of green procurement laws, equipment lifecycle considerations)
– Energy consumption efficiency
– CO2 emissions
– Low noise levels

3. In Conclusion

The role of an infrastructure engineer is to meet non-functional requirements. In this article, we introduced non-functional requirement grades and their importance in system design.
Because there are so many factors to consider, it is not always possible to reach agreement with the customer on every item at the outset of a project. However, it is essential to identify and thoroughly discuss the key items in advance. By effectively leveraging non-functional requirement grades to define requirements clearly, you can prevent issues caused by misaligned expectations later in the process.Thank you for taking the time to read this article.

4. Reference

システム構築の上流工程強化(非機能要求グレード)紹介ページ | アーカイブ | IPA 独立行政法人 情報処理推進機構
情報処理推進機構(IPA)の「システム構築の上流工程強化(非機能要求グレード)紹介ページ」に関する情報です。

コメント