3 months Infrastructure knowledge framework that new infrastructure engineers felt necessary

【table of contents】

 ①The role of infrastructure engineers

 ② System diagram to be monitored (7 items)

 ③ Criteria for determining the urgency of monitored objects (4 levels)

[Contents]

nice to meet you

My name is Mizusawa Yoshihiro and this is my first post on the Beyond Blog

I was welcomed as a member of Beyond in August 2020

Currently, we are based in the Shikoku office in Miyoshi City, Tokushima Prefecture

No, I'm very sorry..

I work at the Persian Cat Office in the same place as Inoue-san, who is a very active Persian cat. I'm always grateful for all your help, including with Nasu

Although it is a brief summary, I have listed my career history in the "Profile" section at the end, so please take a look if you are interested

Recently, for some reason, I have started to seriously think, "I want to become moss," "I want to become moss." Just to be clear, "It's not seaweed," "It's moss," "Moss!"

This is important. It won't be on the test, but it's extremely important

Seriously, I've been pondering deeply, like moss, lately, and I've realized that the reason I've arrived at this way of thinking is because of my "great encounter" with Yamaguchi-san and Susan Cain's books, which I've had over the years as an "introvert" living in an increasingly uncertain world. But I'll talk about that another time

So, this is the main topic of this blog

The theme of my first post is "Understanding systematic knowledge of IT infrastructure." Each topic focuses on the knowledge I gained from reading the book "Introduction to IT Infrastructure Monitoring [Practice] for Software Engineers (by Yuichiro Saito)," and focuses on the points I felt were particularly necessary through my three months of work as a "newbie infrastructure engineer" who made a career change from a non-experienced one

Therefore, please forgive me for focusing more on the "operation and maintenance" side than the design and construction side at this time. In the future, I would like to write blog articles about the "design and construction" side as well


The reason I chose this theme is that while I have many opportunities to come into contact with the "inside" of infrastructure work on a daily basis, I have often felt that not being able to fully grasp the overall picture of the "outside" prevents me from understanding the events occurring on the inside

To use a cooking analogy, I have gradually become able to cook specific dishes such as stir-fried vegetables and hamburger steaks through my daily home cooking, but I feel that this is similar to the struggle I have had because I have not fully grasped the overall picture of cooking, which is the outer framework of cooking, such as "procuring ingredients, cutting, heating, seasoning, and plating."

The catalyst for switching over was my encounter with "Logical Cooking." Specifically, I learned from Chef Mizushima Hiroshi's book that "low-temperature, slow-heating = low to low-medium heat that rises 10°C per minute (vegetables: max 130°C, meat and fish: max 180°C)," "salting = 0.8% of the ingredient's weight," and "cutting with a knife at a 30-degree angle." This helped me to understand highly reproducible elements, which significantly reduced the burden of cooking. And from next month, I'm thinking of trying these methods not just at home, but using kitchen appliances like the Hot Cook and Healsio

I've gone a little off topic, but as an infrastructure engineer, I've begun to feel that it would be a huge waste if we didn't implement this process at this time, so I've decided to take a look back at the ``big picture'' of infrastructure knowledge

I wrote this article in the hope that it will be of particular help to the following people:

 ① Those aiming to become infrastructure engineers

 ② New infrastructure engineer who joined the company a few months ago

 3. Myself one year from now

Well, let's get started

The role of infrastructure engineers

 ①Infrastructure design

  →We take into consideration the elements that will enable us to deliver the performance that our customers expect,

   "Security, redundancy, availability, maintainability, reliability" - less likely to cause failures

   The design is meticulous, including elements for building an infrastructure environment

 ② Infrastructure construction

  →Build an infrastructure environment that is less prone to failures based on design documents

 3) Understanding fault alerts

  →It is extremely important to detect and understand failure alerts as quickly as possible

Our company uses monitoring tools such as "Zabbix" and "Datadog.

   We have also created a system that provides an initial response within five minutes

something I feel keenly in my daily work after new employee training

   Specifically, a delay of even a few seconds can have a significant impact on customer service

   Imagine a situation where you access a website you want to see and have to wait a few seconds

   I think you can see the seriousness of the impact

   And when investigating the cause, it takes only a few seconds to find the location of the problem in the "access log" or other sources

It will take several times longer to identify the problem than it would otherwise .

 ④Recovery response

→Our company aims to "recover as quickly as possible when a failure occurs."

   We are putting in place a system

Specifically, we have established a system where "infrastructure engineers" are on-site 24 hours a day, 365 days a year,
and we strive for continuous improvement every day.

   In addition, in order to ensure smooth communication with stakeholders, we use "Chatwork",

We use

 

System diagram of monitored objects (7)

 ① Alive monitoring (login status to server)

  →Monitor whether Ping communication is possible at the ICMP level with the target host

 ② Port alive monitoring (operation status to the server)

  → Monitor whether communication is available to a specific port (80, 443, etc.)

  →Check using the top or ps command

 3) Process alive monitoring

  → Monitor whether a specific process is running and how many times it is running (Apache, MySQL, Zabbix-agent, etc.)

  (1) MySQL

     - The number of connections is limited to the value in my.cnf

     - Replication delay only applies to slave servers

      The reason is that when data is transferred from the master server to the slave server,

      This is because it is used to monitor delays in the network

     ・For the innoDB Buffer Pool, please use mysqltuner.jp in advance

Consider tuning or increasing the physical memory of your instance if necessary

  →Identify the process where the load is increasing

 4. Performance monitoring

  →Monitor the "performance indicators" of the object being monitored

  (1) CPU Load Average

    ⇒Average execution waiting status for the CPU

     It is calculated every 5 seconds and expressed as an average over 1, 5 and 15 minutes

     Specifically, you can check this with the "top -c" command

  (2) CPU idle

    ⇒CPU availability

  (3) CPU iowait

    ⇒CPU waiting time during input and output processing

 ⑤ Resource monitoring

  →Monitor the "usage" of the monitored object

  (1) Memory usage

    ⇒When the total capacity is insufficient, "OOMKiller" is activated in Linux, and at that point

     It will try to free up memory by killing processes that are using a lot of memory

As a result, depending on the process that is being forcibly terminated, a system failure may occur.

     Specifically, you can check this with the "free" command

  (2) Swap usage rate

    ⇒ "Swap" is the process of temporarily transferring the contents of memory to the hard disk when main memory is insufficient

    ⇒ Thrashing (repeated swap out/in)

    (a) "Swap out" means to swap out unused memory pages when main memory is insufficient

          To move data from an area to swap memory, freeing up space in main memory

    (b) "Swap in" means that a memory page in swap memory is reloaded

Note that reading from swap memory takes longer than reading

It is tens of thousands of times slower and can significantly degrade the performance of the entire server

  (3) Disk usage rate

    ⇒If disk usage is increasing, delete less important past logs and binary files

     Also, setting up log rotation will help prevent bloat

 ⑥  External monitoring

  → Monitor whether specific web pages containing your content can be accessed

The top page is monitored via a load balancer

  → Monitor whether the communication path to the web server is interrupted based on the ping response time

  → If packet loss occurs, the communication speed limit has been reached

There may be a malfunction in the network equipment or an error in the network device driver

 ⑦ Information acquisition

  → Obtain information about the monitored object

  (1) Notify when the value of "System Hostname" is changed

  (2) Notify when the result value of the "uname command" is changed

  (3) Notify when the value of the "/etc/passwd file checksum" is changed

Criteria for determining the urgency of monitored objects (4 levels)

 ① A rank

  ⇒This is something that needs to be addressed as a top priority due to the possibility of service being suspended

  (a) Alert occurrence in alive monitoring

  (b) Alert generated during port health monitoring

  (c) An alert occurs during external monitoring (a specific customer website)

  (d) Alert generated during process alive monitoring (Apache)

  (e) Alert generated during process alive monitoring (MySQL)

 ② B rank

  ⇒If left unattended, there is a risk of service being suspended, so this is something that needs to be addressed as a priority

  (a) CPU Load Average alert occurs

  (b) CPU idle alert

  (c) CPU iowait alert occurs

  (d) Memory usage alert

  (e) Alert occurrence in process alive monitoring (Zabbix-agent)

 ③ C rank

  ⇒This does not immediately affect services, but requires a response as quickly as possible

  (a) Swap usage rate alert

  (b) Disk usage alert

 ④ I rank

  ⇒ Information notification only. An investigation into the cause of the incident will be conducted

  (a) Notification of Changes to System Hostname

  (b) Notification of changes resulting from the "uname command."

  (c) Notification of change to "Checksum of /etc/passwd file."

summary

What do you think?
Were they helpful in understanding the overall picture of infrastructure operations?

Since I first started working as an infrastructure engineer in August 2020, I have come to realize that this is a meaningful job that allows me to support our customers' important "information assets."

However, at the same time, I feel a sense of tension and crisis every day, and I keep hitting walls. To be honest, I don't know how far I've climbed in three months, but I think the only thing I can do is continue to struggle every day and increase the things I can do now, one thing at a time

Fortunately, I am blessed with people who are willing to listen to my questions after I have formulated and thought about my own hypotheses, and I am very grateful for that

In the process, I would like to continue to work hard as an infrastructure engineer to be of service to all stakeholders

I apologize for the length of this post, but let me finally talk about my experiences as an introvert. Sorry..

The reason is that I'm beginning to feel that the profession of "infrastructure engineer" might be one of the environments in which "introverted types" can fit in

I am currently in a phase where I can naturally accept and develop my introverted personality. However, until I was 29, I didn't feel that way at all, and I continued to deny myself

I was able to accept these things because I was blessed with the appropriate opportunities, but for me it was an extremely tough journey..

That's why I would personally be very happy if I could be of help to introverts through the process of working as an infrastructure engineer

We believe that we have a long way to go, but we hope that you will take a long-term view like moss. Thank you very much for your continued support

If you found this article helpful , please give it a like!
0
Loading...
0 votes, average: 0.00 / 10
1,830
X facebook Hatena Bookmark pocket

The person who wrote this article

About the author

Yoshihiro Mizusawa / Yoshihiro Mizusawa

【WORK EXPERIENCE】
・13 years of quality control operating, improving education system and business operation experience in a logistics company for precision machinery.
・For a year settled in Argentina.
・2 years of developing, operating and monitoring cloud server systems as an infrastructure engineer at the Tech field of Beyond

[QUALIFICATIONS]
・LPIC Level3
・AWS SAA