3 months Infrastructure knowledge framework that new infrastructure engineers felt necessary
【table of contents】
① Role of infrastructure engineer
②System diagrams to be monitored (7)
③ Judgment criteria for each level of urgency for monitoring targets (4)
[Contents]
nice to meet you.
My name is Yoshihiro Mizusawa, and this is my first post on Beyond Blog.
We have been welcoming you as a member of Beyond since August 2020.
Currently at the Shikoku office in Miyoshi City, Tokushima Prefecture. . .
No, I'm very sorry. . .
I work at the "Persian Cat Office," which is the same location as Mr. Inoue, who is very active as a Persian cat. . . Thank you for always taking care of me, including the eggplants.
Although it may be a bit simple, I have written about my career so far in the "Profile" section at the end, so if you are interested, please take a look.
Well, lately, for some reason, I've been seriously thinking, ``I want to be a moss'' or ``I want to be a moss.'' . . Just to be sure, "It's not seaweed," "It's moss," and "Moss!"
This is important. It won't appear on the test, but it's extremely important.
Seriously. I have come to this point of view over the long period of time I have lived as an introvert in a world of increasing uncertainty. These days, I'm deeply introspecting that it has something to do with "encounter." . . But that's a story for another time. . .
So, this is the main topic of this blog.
For the content of my first post, I set the theme of ``understanding systematic IT infrastructure knowledge.'' For each topic, I, a ``new infrastructure engineer'' who changed my career from no experience, based on the knowledge I learned from reading the book ``Introduction to IT infrastructure monitoring [Practical] for software engineers, written by Yuichiro Saito.'' It is structured to focus on the points that we felt were especially necessary through the months of practical work.
Therefore, we apologize for the inconvenience that we are currently working more towards the operation and maintenance side than the design and construction side. In the future, I would like to write a blog article about the "design and construction" part as well.
The reason for choosing this theme is that while we have many opportunities to interact with the ``inside'' of infrastructure operations on a daily basis, we are unable to fully grasp the overall picture of the ``outside.'' This is because there have been many times when I have felt that I am preventing myself from doing so.
If we were to compare it to cooking, through our daily self-cooking, we gradually become able to do spot-on cooking such as stir-frying vegetables or hamburgers, but in the first place, we have to focus on the outer framework of cooking, such as ``purchasing, cutting, heating, etc.'' I thought it was similar to the situation where I had struggled because I was not able to fully grasp the overall picture, such as "seasoning, seasoning, and presentation."
What led me to change that was my encounter with "logical cooking." Specifically, from Chef Hiroshi Mizushima's book, ``Low and slow heating = low to medium heat that rises 10℃ in 1 minute (vegetables: max 130℃, meat and fish: max 180℃)'' and ``Saltiness = 0 of the weight of the material.'' By being able to grasp highly reproducible elements such as ". Starting next month, we are thinking of using cooking appliances such as ``Hot Cook'' and ``Healsio'' instead of cooking by ourselves.
This is a little off-topic, but even as an infrastructure engineer, I started to feel that it would be a huge waste if I didn't implement such a process at this timing.As a result, I decided to reflect on the "overall picture" of infrastructure knowledge. It depends on what you decide.
In particular, I wrote this in hopes of being of help to the following people.
① Those aiming to become infrastructure engineers
② New infrastructure engineer who has been with the company for a few months
③ Myself after 1 year
Well then, let's go.
The role of an infrastructure engineer
① Infrastructure design
→At the same time as taking into consideration the factors to achieve the performance expected by the customer,
Failures such as “security, redundancy, availability, maintainability, and reliability” are less likely to occur.
Carefully design the infrastructure environment, including the elements needed to build it.
② Infrastructure construction
→Build an infrastructure environment that is less prone to failure based on the design documents.
③ Understanding failure alerts
→It is extremely important to detect and understand failure alerts as soon as possible.
At our company, we use monitoring tools such as "Zabbix" and "Datadog.
We have also established a system to provide an initial response within 5 minutes.
→Why do you care about speed? I am acutely aware of the reason for this every day in my work after new employee training
Specifically, a delay of just a few seconds can have a significant impact on customer service.
If you imagine a situation where you have to wait a few seconds after accessing a website you want to view,
I think you can feel the seriousness of its impact.
And when investigating the cause, we can check the location of the failure, such as the access log, with a delay of just a few seconds.
It will take several times more time to identify it than it would otherwise .
④Recovery response
→At our company, we strive to achieve recovery as quickly as possible in the event of a failure.
We have a system in place.
Specifically,
we are making continuous improvements every day while creating a system where "infrastructure engineers" are available 24 hours a day, 365 days a year.
In addition, in order to facilitate smooth communication with stakeholders, we use "Chatwork".
We use
System diagram of monitoring targets (7)
① Alive monitoring (login status to server)
→Monitor whether there is ping communication at the ICMP level to the target host.
② Port alive monitoring (operation status to server)
→Monitor whether there is communication to a specific port (80, 443, etc.).
→ Check with top or ps command.
③ Alive monitoring of processes
→Monitor the startup status and number of startups of specific processes (Apache, MySQL, Zabbix-agent, etc.).
(1) MySQL
- Regarding the number of connections, the upper limit is the value of my.cnf.
- Replication delay applies only to slave servers.
The reason is that when transferring data from the master server to the slave server,
This is because it is used to monitor the delay status.
・For innoDB Buffer Pool, use mysqltuner.jp in advance.
Consider tuning or increasing the physical memory of the instance as needed
→Identify the process where the load is increasing.
④ Performance monitoring
→Monitor the "performance indicators" of the monitored target.
(1) CPU Load Average
⇒Average execution wait status for CPU.
It is calculated every 5 seconds and is expressed as the average of 1 minute, 5 minutes, and 15 minutes.
Specifically, you can check with the "top -c" command.
(2) CPU idle
⇒CPU availability.
(3) CPU iowait
⇒CPU wait time for input and output processing.
⑤ Resource monitoring
→Monitor the "usage status" of the monitored target.
(1) Memory usage rate
⇒When the total capacity is insufficient, "OOMKiller" is activated in Linux, and at that point
Attempts to free up memory by killing processes that are using a lot of memory.
As a result, depending on the process targeted for forced termination, a system failure may occur.
Specifically, you can check it with the "free" command.
(2) Swap usage rate
⇒ "Swap" is a function that temporarily moves the contents of memory to the hard disk when main memory is insufficient.
⇒Thrashing (repeated Swap out/in)
(a) "Swap out" refers to unused memory pages when main memory is insufficient.
Move from the area to Swap memory to free up main memory space.
(b) "Swap in" means that memory pages in Swap memory are read again.
Note that reading from swap memory is faster than reading
It is tens of thousands of times slower and can significantly degrade overall server performance
(3) Disk usage rate
⇒If the disk usage rate has increased, delete unimportant past logs and binary files.
Also, configuring log rotation will help prevent the log from becoming too large.
⑥ Outline monitoring
→Monitor whether specific web pages containing your content can be accessed.
The top page is monitored via a load balancer
→Monitor whether the communication route with the web server is interrupted based on the ping response time.
→If packet loss occurs, the upper limit of communication speed has been reached. or,
There may be a problem with the network equipment or the network device driver
⑦ Information acquisition
→Acquire information about the monitoring target.
(1) Notify when the value of "system host name" is changed.
(2) Notify when the value of the result of the "uname command" is changed.
(3) Notify when the value of "checksum of /etc/passwd file" is changed.
Judgment criteria for each level of urgency for monitoring targets (4)
① A rank
⇒ Items that require top priority response as there is a possibility of service outage.
(a) Alert occurrence during life-and-death monitoring
(b) Alert occurrence during port aliveness monitoring
(c) Alert occurrence during appearance monitoring (customer's specific website)
(d) Alert generation during process aliveness monitoring (Apache)
(e) Alert generation during process aliveness monitoring (MySQL)
② B rank
⇒Things that require priority response as there is a possibility of service suspension if left untreated.
(a) CPU Load Average alert generated
(b) CPU idle alert generated
(c) CPU iowait alert occurs
(d) Memory usage alert generated
(e) Alert generation during process aliveness monitoring (Zabbix-agent)
③ C rank
⇒Those that do not have an immediate impact on services, but require a response as quickly as possible.
(a) Swap usage rate alert
(b) Disk usage alert occurrence
④ I rank
⇒ Information notification only. Conducting an investigation into the cause of the occurrence.
(a) “System Host Name” Change Notification.
(b) Notification of changes in the results of the "uname command".
(c) Change notification for "Checksum of /etc/passwd file".
summary
What did you think?
We hope this helped you understand the overall picture of infrastructure operations.
Ever since I first started working as an infrastructure engineer in August 2020, I have realized that this work is a meaningful job that allows me to support our customers' important "information assets."
However, on the other hand, I continue to run into walls every day with a sense of tension and crisis. To be honest, I don't know how much I've been able to climb in three months, but I think I have no choice but to continue to struggle every day and increase what I can do now one by one.
Fortunately, I am very grateful that I am blessed with people who are willing to help me when I ask questions after formulating my own hypotheses.
In the process, I would like to continue to work hard as an infrastructure engineer so that I can be of use to all stakeholders.
I'm sorry that this is already a long post, but at the end, I would like to talk about this as an introvert. sorry. . .
The reason for this is that I'm starting to feel that the profession of "infrastructure engineer" may be one of the environments in which "introverted" people can fit in.
Currently, I am in a phase where I can naturally accept and nurture my personality as an introvert. However, until I was 29 years old, I didn't feel that way at all, and I spent many days denying myself.
I was able to accept them because I was blessed with the appropriate opportunities, but for me it was a very difficult journey. . .
That's why I personally would be very happy if I could be of help to introverted people through the process of working as an infrastructure engineer.
I think it's going to be a long road ahead, but I hope you will look at it from a long-term perspective like you would a moss. Thank you very much for your support.