3 months Infrastructure knowledge framework that new infrastructure engineers felt necessary

table of contents
【table of contents】
①The role of infrastructure engineers
② System diagram to be monitored (7 items)
③ Criteria for determining the urgency of monitored objects (4 levels)
[Contents]
nice to meet you
My name is Mizusawa Yoshihiro and this is my first post on the Beyond Blog
I was welcomed as a member of Beyond in August 2020
Currently, we are based in the Shikoku office in Miyoshi City, Tokushima Prefecture
No, I'm very sorry..
I work at the Persian Cat Office in the same place as Inoue-san, who is a very active Persian cat. I'm always grateful for all your help, including with Nasu
Although it is a brief summary, I have listed my career history in the "Profile" section at the end, so please take a look if you are interested
Recently, for some reason, I have started to seriously think, "I want to become moss," "I want to become moss." Just to be clear, "It's not seaweed," "It's moss," "Moss!"
This is important. It won't be on the test, but it's extremely important
Seriously, I've been pondering deeply, like moss, lately, and I've realized that the reason I've arrived at this way of thinking is because of my "great encounter" with Yamaguchi-san and Susan Cain's books, which I've had over the years as an "introvert" living in an increasingly uncertain world. But I'll talk about that another time
So, this is the main topic of this blog
The theme of my first post is "Understanding systematic knowledge of IT infrastructure." Each topic focuses on the knowledge I gained from reading the book "Introduction to IT Infrastructure Monitoring [Practice] for Software Engineers (by Yuichiro Saito)," and focuses on the points I felt were particularly necessary through my three months of work as a "newbie infrastructure engineer" who made a career change from a non-experienced one
Therefore, please forgive me for focusing more on the "operation and maintenance" side than the design and construction side at this time. In the future, I would like to write blog articles about the "design and construction" side as well
The reason I chose this theme is that while I have many opportunities to come into contact with the "inside" of infrastructure work on a daily basis, I have often felt that not being able to fully grasp the overall picture of the "outside" prevents me from understanding the events occurring on the inside
To use a cooking analogy, I have gradually become able to cook specific dishes such as stir-fried vegetables and hamburger steaks through my daily home cooking, but I feel that this is similar to the struggle I have had because I have not fully grasped the overall picture of cooking, which is the outer framework of cooking, such as "procuring ingredients, cutting, heating, seasoning, and plating."
The catalyst for switching over was my encounter with "Logical Cooking." Specifically, I learned from Chef Mizushima Hiroshi's book that "low-temperature, slow-heating = low to low-medium heat that rises 10°C per minute (vegetables: max 130°C, meat and fish: max 180°C)," "salting = 0.8% of the ingredient's weight," and "cutting with a knife at a 30-degree angle." This helped me to understand highly reproducible elements, which significantly reduced the burden of cooking. And from next month, I'm thinking of trying these methods not just at home, but using kitchen appliances like the Hot Cook and Healsio
I've gone a little off topic, but as an infrastructure engineer, I've begun to feel that it would be a huge waste if we didn't implement this process at this time, so I've decided to take a look back at the ``big picture'' of infrastructure knowledge
I wrote this article in the hope that it will be of particular help to the following people:
① Those aiming to become infrastructure engineers
② New infrastructure engineer who joined the company a few months ago
3. Myself one year from now
Well, let's get started
The role of infrastructure engineers
①Infrastructure design
→We take into consideration the elements that will enable us to deliver the performance that our customers expect,
"Security, redundancy, availability, maintainability, reliability" - less likely to cause failures
The design is meticulous, including elements for building an infrastructure environment
② Infrastructure construction
→Build an infrastructure environment that is less prone to failures based on design documents
3) Understanding fault alerts
→It is extremely important to detect and understand failure alerts as quickly as possible
Our company uses monitoring tools such as "Zabbix" and "Datadog.
We have also created a system that provides an initial response within five minutes
something I feel keenly in my daily work after new employee training
Specifically, a delay of even a few seconds can have a significant impact on customer service
Imagine a situation where you access a website you want to see and have to wait a few seconds
I think you can see the seriousness of the impact
And when investigating the cause, it takes only a few seconds to find the location of the problem in the "access log" or other sources
It will take several times longer to identify the problem than it would otherwise .
④Recovery response
→Our company aims to "recover as quickly as possible when a failure occurs."
We are putting in place a system
Specifically, we have established a system where "infrastructure engineers" are on-site 24 hours a day, 365 days a year,
and we strive for continuous improvement every day.
In addition, in order to ensure smooth communication with stakeholders, we use "Chatwork",
System diagram of monitored objects (7)
① Alive monitoring (login status to server)
→Monitor whether Ping communication is possible at the ICMP level with the target host
② Port alive monitoring (operation status to the server)
→ Monitor whether communication is available to a specific port (80, 443, etc.)
→Check using the top or ps command
3) Process alive monitoring
→ Monitor whether a specific process is running and how many times it is running (Apache, MySQL, Zabbix-agent, etc.)
(1) MySQL
- The number of connections is limited to the value in my.cnf
- Replication delay only applies to slave servers
The reason is that when data is transferred from the master server to the slave server,
This is because it is used to monitor delays in the network
・For the innoDB Buffer Pool, please use mysqltuner.jp in advance
Consider tuning or increasing the physical memory of your instance if necessary
→Identify the process where the load is increasing
4. Performance monitoring
→Monitor the "performance indicators" of the object being monitored
(1) CPU Load Average
⇒Average execution waiting status for the CPU
It is calculated every 5 seconds and expressed as an average over 1, 5 and 15 minutes
Specifically, you can check this with the "top -c" command
(2) CPU idle
⇒CPU availability
(3) CPU iowait
⇒CPU waiting time during input and output processing
⑤ Resource monitoring
→Monitor the "usage" of the monitored object
(1) Memory usage
⇒When the total capacity is insufficient, "OOMKiller" is activated in Linux, and at that point
It will try to free up memory by killing processes that are using a lot of memory
As a result, depending on the process that is being forcibly terminated, a system failure may occur.
Specifically, you can check this with the "free" command
(2) Swap usage rate
⇒ "Swap" is the process of temporarily transferring the contents of memory to the hard disk when main memory is insufficient
⇒ Thrashing (repeated swap out/in)
(a) "Swap out" means to swap out unused memory pages when main memory is insufficient
To move data from an area to swap memory, freeing up space in main memory
(b) "Swap in" means that a memory page in swap memory is reloaded
Note that reading from swap memory takes longer than reading
It is tens of thousands of times slower and can significantly degrade the performance of the entire server
(3) Disk usage rate
⇒If disk usage is increasing, delete less important past logs and binary files
Also, setting up log rotation will help prevent bloat
⑥ External monitoring
→ Monitor whether specific web pages containing your content can be accessed
The top page is monitored via a load balancer
→ Monitor whether the communication path to the web server is interrupted based on the ping response time
→ If packet loss occurs, the communication speed limit has been reached
There may be a malfunction in the network equipment or an error in the network device driver
⑦ Information acquisition
→ Obtain information about the monitored object
(1) Notify when the value of "System Hostname" is changed
(2) Notify when the result value of the "uname command" is changed
(3) Notify when the value of the "/etc/passwd file checksum" is changed
Criteria for determining the urgency of monitored objects (4 levels)
① A rank
⇒This is something that needs to be addressed as a top priority due to the possibility of service being suspended
(a) Alert occurrence in alive monitoring
(b) Alert generated during port health monitoring
(c) An alert occurs during external monitoring (a specific customer website)
(d) Alert generated during process alive monitoring (Apache)
(e) Alert generated during process alive monitoring (MySQL)
② B rank
⇒If left unattended, there is a risk of service being suspended, so this is something that needs to be addressed as a priority
(a) CPU Load Average alert occurs
(b) CPU idle alert
(c) CPU iowait alert occurs
(d) Memory usage alert
(e) Alert occurrence in process alive monitoring (Zabbix-agent)
③ C rank
⇒This does not immediately affect services, but requires a response as quickly as possible
(a) Swap usage rate alert
(b) Disk usage alert
④ I rank
⇒ Information notification only. An investigation into the cause of the incident will be conducted
(a) Notification of Changes to System Hostname
(b) Notification of changes resulting from the "uname command."
(c) Notification of change to "Checksum of /etc/passwd file."
summary
What do you think?
Were they helpful in understanding the overall picture of infrastructure operations?
Since I first started working as an infrastructure engineer in August 2020, I have come to realize that this is a meaningful job that allows me to support our customers' important "information assets."
However, at the same time, I feel a sense of tension and crisis every day, and I keep hitting walls. To be honest, I don't know how far I've climbed in three months, but I think the only thing I can do is continue to struggle every day and increase the things I can do now, one thing at a time
Fortunately, I am blessed with people who are willing to listen to my questions after I have formulated and thought about my own hypotheses, and I am very grateful for that
In the process, I would like to continue to work hard as an infrastructure engineer to be of service to all stakeholders
I apologize for the length of this post, but let me finally talk about my experiences as an introvert. Sorry..
The reason is that I'm beginning to feel that the profession of "infrastructure engineer" might be one of the environments in which "introverted types" can fit in
I am currently in a phase where I can naturally accept and develop my introverted personality. However, until I was 29, I didn't feel that way at all, and I continued to deny myself
I was able to accept these things because I was blessed with the appropriate opportunities, but for me it was an extremely tough journey..
That's why I would personally be very happy if I could be of help to introverts through the process of working as an infrastructure engineer
We believe that we have a long way to go, but we hope that you will take a long-term view like moss. Thank you very much for your continued support
0