[Osaka/Yokohama/Tokushima] Looking for infrastructure/server side engineers!

[Deployed by over 500 companies] AWS construction, operation, maintenance, and monitoring services

[Successor to CentOS] AlmaLinux OS server construction/migration service

[For WordPress only] Cloud server “Web Speed”

[Cheap] Website security automatic diagnosis “Quick Scanner”

[Reservation system development] EDISONE customization development service

[Registration of 100 URLs is 0 yen] Website monitoring service “Appmill”

[Compatible with over 200 countries] Global eSIM “Beyond SIM”

[If you are traveling, business trip, or stationed in China] Chinese SIM service “Choco SIM”

[Global exclusive service] Beyond's MSP in North America and China

[YouTube] Beyond official channel “Biyomaru Channel”

Aggregate data for a specific period from csv using Python [pandas]

2022.10.04

Osaka office

table of contents

1 environment
2 big picture
3 Purpose and preparation
4 Partial explanation
5 Output result

Hello.
My dreams are strong, but my insides are shaky. I'm Kawa from the System Solutions Department.

It's a pleasant season.
So this time,
I would like to mainly use the basic Python data analysis library "pandas" to aggregate a large amount of data from csv files.

environment

OS used: Microsoft Windows 10 Pro
Python version: 3.10.4

big picture

import pandas as pd from datetime import datetime as dt import collections import itertools df = pd.read_csv('<filename>.csv', usecols=['datetime', 'person_in_charge'], encoding='cp932' ) df_date = df .set_index('datetime') #For error checking def check(x): #print(x) pd.to_datetime(x) df_date.index.map(check) df_date.index=pd.to_datetime(df_date.index,format= '%Y%m%d %H:%M') df_date.sort_index(inplace=True) df_date['count'] = 1 df_multi = df_date.set_index([df_date.index.year, df_date.index.month, df_date .index.weekday, df_date.index.hour, df_date.index]) df_multi.index.names = ['year', 'month', 'weekday', 'hour', 'date'] df_date['person_in_charge'] = df_date['person_in_charge'].str.split(',') all_tag_list = list(itertools.chain.from_iterable(df_date['person_in_charge'])) c = collections.Counter(itertools.chain.from_iterable(df_date['person_in_charge' ])) tags = pd.Series(c) df_tag_list = [] top_tag_list = tags.sort_values(ascending=False).index[:11].tolist() for t in top_tag_list: df_tag = df_date[df_date['person_in_charge'] .apply(lambda x: t in x)] df_tag_list.append(df_tag[['count']].resample('1M').sum()) df_tags = pd.concat(df_tag_list, axis=1) df_tags.columns = top_tag_list df_tags.to_csv('result.csv', encoding='cp932') print("done")

Purpose and preparation

the csv file that serves as raw data
contains a large amount of information including the date the inquiry was received (in yyyy/mm/dd format) and the name of the person in charge who received the inquiry.
As a preprocessing, the date column name is set to "datetime" and the person in charge name is set to "person_in_charge".
Since it uses pandas as a library, it needs to
be installed from pip (For other environments such as Anaconda, please refer to the link)

Partial explanation

▼ Load the file with pandas. The csv file is placed in the same folder as the code file.

df = pd.read_csv('<filename>.csv', usecols=['datetime', 'person_in_charge'], encoding='cp932' )

▼ Assign the “datetime” column to the index using df.set_index.
*The latter part is for checking if there are any blank cells, so you can omit it.

df_date = df.set_index('datetime') def check(x): pd.to_datetime(x) df_date.index.map(check)

▼ Convert data type to datetime format & sort

df_date.index=pd.to_datetime(df_date.index,format='%Y%m%d %H:%M') df_date.sort_index(inplace=True) #Add count column for resampling df_date['count' ] = 1

※What is resampling (external link)

▼ Grouping or counting by day or time

df_multi = df_date.set_index([df_date.index.year, df_date.index.month, df_date.index.weekday, df_date.index.hour, df_date.index]) df_multi.index.names = ['year', 'month' , 'weekday', 'hour', 'date'] df_date['person_in_charge'] = df_date['person_in_charge'].str.split(',') all_tag_list = list(itertools.chain.from_iterable(df_date['person_in_charge' ])) Pass #iiterate to collection. Count the number of occurrences of each tag c = collections.Counter(itertools.chain.from_iterable(df_date['person_in_charge'])) #Convert to series tags = pd.Series(c)

※What is pandas series? (external link)

▼ Store tags in DataFrame

df_tag_list = [] top_tag_list = tags.sort_values(ascending=False).index[:11].tolist()

▼ This is the key: If you change the "1M" part of "df_tag_list.append(df_tag[['count']].resample('1M').sum())"
to "2W", for example, it will take 2 weeks If you select "'7D'", you can collect data every 7 days.

for t in top_tag_list: df_tag = df_date[df_date['person_in_charge'].apply(lambda x: t in x)] df_tag_list.append(df_tag[['count']].resample('1M').sum()) df_tags = pd.concat(df_tag_list, axis=1) df_tags.columns = top_tag_list

▼ Output the aggregated data to a new file

df_tags.to_csv('result.csv', encoding='cp932') #Output "done" when finished print("done")

Output result

the monthly data
compiled If you aggregate it by week or day if necessary and decorate it well, you can use it as data at meetings!

See you soon!

If you found this article helpful , please give it a like!

[2025.6.30 Amazon Linux 2 support ended] Amazon Linux server migration solution

The person who wrote this article

About the author

Kawa Ken

A curious Poke○n who belongs to the System Solution Department.

[Object storage] Comparison of Wasabi and Amazon S3 services [S3 compatible] Basic authentication using Nginx