Features Engineering techniques for Time-series Analysis in Python.
Before we start digging into our Data, it’s very important to have an idea about Time Series. Basically, A time series is a sequence of numerical data points in successive order, In most cases, Time series is a set of observations on the values that a variable takes at different times.
In this article, I’m going to focus on some basic techniques of Features Engineering that can enhance your predictions over time. It’s been a long time since collaborating in a time series project, so I decided to have a quick refresh of the techniques I used to improve my prediction and share it with you.
Feature engineering is the process of using the domain knowledge to create relevant features in order to make your Machine Learning algorithms more accurate. Feature engineering can impact noticeably the performance of your modeling journey, If it’s done correctly, it helps your model to perform very well.
Before involving this step, you need to follow the very basic pipeline of solving a problem in Data Science, which is :
The data is pretty self-explanatory, it’s a univariate Time Series, we some repetitive pattern, we have two columns Date and value, the date column is treated as an object, so we need to convert it into a DateTime type, this can be done easily using Pandas :
import pandas as pd
data = pd.read_csv("AEP_hourly.csv")
data["AEP_MW"].plot(figsize = (8,3))
data.head()
data["Datetime"] = pd.to_datetime(data['Datetime'],format='%Y-%m-%d %H:%M')
Now that our data is ready, let’s have a look about things that we can dig from those variables, as I said, the dataset contains two columns, a date associated with a value, so let’s discuss some of the techniques used to engineer this kind of problems.
Date related features are a special type of categorical variables, a date provides only a specific point on a timeline. A date can be engineered properly if we have some domain knowledge or a deep understanding of the problem
To answer the first question, if our data is related to some personal usage, for example, we know that the use of energy is lower at night compared to daytime; so we can conclude that there can some hidden pattern if we extract hours from the date, and this answers the second question too since we know that there’s a repetitive trend in our data.
Extracting weekends and weekdays could be useful to know if there are any changing patterns during the week, and months/years too could be a very helpful measure to see if there’s a change in the seasons.
Extracting these features is very simple with Python :
data['year']=data['Datetime'].dt.year
data['month']=data['Datetime'].dt.month
data['day']=data['Datetime'].dt.day
data['dayofweek']=data['Datetime'].dt.dayofweek
data['dayofweek_name']=data['Datetime'].dt.day_name()
data.head()
We can extract more than that, for example :
import pandas as pd
data = pd.read_csv("AEP_hourly.csv")
data["Datetime"] = pd.to_datetime(data['Datetime'],format='%Y-%m-%d %H:%M')
data['year']=data['Datetime'].dt.year
data['month']=data['Datetime'].dt.month
data['day']=data['Datetime'].dt.day
data['dayofweek']=data['Datetime'].dt.dayofweek
data['day_name']=data['Datetime'].dt.day_name()
data['quarter']=data['Datetime'].dt.quarter
data['is_month_end']=data['Datetime'].dt.is_month_end
data['is_month_start']=data['Datetime'].dt.is_month_start
data['is_year_start']=data['Datetime'].dt.is_month_start
data['is_year_end']=data['Datetime'].dt.is_month_start
In the same way, we can extract more features using the timestamp part of the date, for example, we can extract hours, minutes, and even seconds since the date is hourly so the only feature that we can extract here is hours.
data['Hour'] = data['Datetime'].dt.hour
We can generate a lot of features using the date column, here’s the documentation link if you want to explore more about Date/time-based features.
The lag features are basically the target variable but shifted with a period of time, it is used to know the behavior of our target value in the past, maybe a day before, a week or a month.
Using lag features can be sometimes a double-edged sword, since using the target variable is very tricky and it can lead sometimes to overfitting if not used properly.
data["Lag1"] = data['AEP_MW'].shift(3) # Lag with 3 hours
data["Lag5"] = data['AEP_MW'].shift(5) # Lag with 5 hours
data["Lag8"] = data['AEP_MW'].shift(8) # Lag with 8 hours
Rolling window features are calculating some statistics based on past values :
We can implement that using this line of code :
data['rollingMean'] = data['AEP_MW'].rolling(window=6).mean()
Similarly, we can generate a number of different time series features that can be useful to predict different forecast distances :
Another type of window that includes all previous data in the series.
This can be implemented easily in Python by using the expanding() function :
data['ExpandingMean'] = data['AEP_MW'].expanding(6).mean()
All those features can be useful when we have a good understanding of the problem, and this will lead us to make meaningful insights and avoid features explosion.
There’s some other useful Features Engineering techniques for non-Time Series data that I like to share with you.
There’s some other type of feature engineering that involves revealing interactions between features
A general tip is to look at each pair of features and ask yourself, “could I combine this information in any way that might be even more useful?”
Feature engineering turns your inputs into things the algorithm can understand.
Generating too many features can confuse your model and lower the performance, we should always choose the most contributing features and remove unused and redundant ones, and most importantly don’t ever use your target variable as an indicator (this includes a very close lag feature), this could give you very misleading results.
That’s all Folks ! this was my very first article by the way, I hope you enjoy it.
Don’t forget .. Believe the data, not opinions.
Data source: https://www.kaggle.com/robikscube/hourly-energy-consumption