Aggregate data for a specific period from csv using Python [pandas]
Hello.
My dreams are strong, but my insides are shaky. I'm Kawa from the System Solutions Department.
It's a pleasant season.
So this time,
I would like to mainly use the basic Python data analysis library "pandas" to aggregate a large amount of data from csv files.
environment
OS used: Microsoft Windows 10 Pro
Python version: 3.10.4
big picture
import pandas as pd from datetime import datetime as dt import collections import itertools df = pd.read_csv('<filename>.csv', usecols=['datetime', 'person_in_charge'], encoding='cp932' ) df_date = df .set_index('datetime') #For error checking def check(x): #print(x) pd.to_datetime(x) df_date.index.map(check) df_date.index=pd.to_datetime(df_date.index,format= '%Y%m%d %H:%M') df_date.sort_index(inplace=True) df_date['count'] = 1 df_multi = df_date.set_index([df_date.index.year, df_date.index.month, df_date .index.weekday, df_date.index.hour, df_date.index]) df_multi.index.names = ['year', 'month', 'weekday', 'hour', 'date'] df_date['person_in_charge'] = df_date['person_in_charge'].str.split(',') all_tag_list = list(itertools.chain.from_iterable(df_date['person_in_charge'])) c = collections.Counter(itertools.chain.from_iterable(df_date['person_in_charge' ])) tags = pd.Series(c) df_tag_list = [] top_tag_list = tags.sort_values(ascending=False).index[:11].tolist() for t in top_tag_list: df_tag = df_date[df_date['person_in_charge'] .apply(lambda x: t in x)] df_tag_list.append(df_tag[['count']].resample('1M').sum()) df_tags = pd.concat(df_tag_list, axis=1) df_tags.columns = top_tag_list df_tags.to_csv('result.csv', encoding='cp932') print("done")
Purpose and preparation
the csv file that serves as raw data
contains a large amount of information including the date the inquiry was received (in yyyy/mm/dd format) and the name of the person in charge who received the inquiry.
As a preprocessing, the date column name is set to "datetime" and the person in charge name is set to "person_in_charge".
Since it uses pandas as a library, it needs to
be installed from pip (For other environments such as Anaconda, please refer to the link)
Partial explanation
▼ Load the file with pandas. The csv file is placed in the same folder as the code file.
df = pd.read_csv('<filename>.csv', usecols=['datetime', 'person_in_charge'], encoding='cp932' )
▼ Assign the “datetime” column to the index using df.set_index.
*The latter part is for checking if there are any blank cells, so you can omit it.
df_date = df.set_index('datetime') def check(x): pd.to_datetime(x) df_date.index.map(check)
▼ Convert data type to datetime format & sort
df_date.index=pd.to_datetime(df_date.index,format='%Y%m%d %H:%M') df_date.sort_index(inplace=True) #Add count column for resampling df_date['count' ] = 1
※What is resampling (external link)
▼ Grouping or counting by day or time
df_multi = df_date.set_index([df_date.index.year, df_date.index.month, df_date.index.weekday, df_date.index.hour, df_date.index]) df_multi.index.names = ['year', 'month' , 'weekday', 'hour', 'date'] df_date['person_in_charge'] = df_date['person_in_charge'].str.split(',') all_tag_list = list(itertools.chain.from_iterable(df_date['person_in_charge' ])) Pass #iiterate to collection. Count the number of occurrences of each tag c = collections.Counter(itertools.chain.from_iterable(df_date['person_in_charge'])) #Convert to series tags = pd.Series(c)
※What is pandas series? (external link)
▼ Store tags in DataFrame
df_tag_list = [] top_tag_list = tags.sort_values(ascending=False).index[:11].tolist()
▼ This is the key: If you change the "1M" part of "df_tag_list.append(df_tag[['count']].resample('1M').sum())"
to "2W", for example, it will take 2 weeks If you select "'7D'", you can collect data every 7 days.
for t in top_tag_list: df_tag = df_date[df_date['person_in_charge'].apply(lambda x: t in x)] df_tag_list.append(df_tag[['count']].resample('1M').sum()) df_tags = pd.concat(df_tag_list, axis=1) df_tags.columns = top_tag_list
▼ Output the aggregated data to a new file
df_tags.to_csv('result.csv', encoding='cp932') #Output "done" when finished print("done")
Output result
the monthly data
compiled If you aggregate it by week or day if necessary and decorate it well, you can use it as data at meetings!
See you soon!