Structured Data Learning

1. auto-reload modules

r

#To auto-reload modules in jupyter notebook (so that changes in files *.py doesn't require manual reloading): %reload_ext autoreload %autoreload 2 #To inline the output of plotting commands is displayed inline within frontend Jupyter notebook %matplotlib inline

3. Check NVidia GPU framework

r

# NVidia GPU with programming framework CUDA is critical & following command must return true torch.cuda.is_available() # Make sure deep learning package from CUDA CuDNN is enabled for improving training performance ( prefered) torch.backends.cudnn.enabled

4. Set Parameters

r

# Example 1: Binary Image Classifcation #Path is path to Data PATH='/home/paperspace/fastai/courses/SelfCodes/Structured and Time series analysis/data/' os.chdir(PATH) %pwd

5. Observations

a. Observe Folder Structure of path

r

# Example of cats and dogs # list directories of 'PATH' os.listdir(PATH) # list directories of 'train' os.listdir(f'{PATH}train') # or

b. Observe Files

r

# store data and csv files observe the same

6. Feature Engineering

Categorical & Training Variables

r

There are two types of columns:Categorical — It has a number of “levels” e.g. StoreType, AssortmentContinuous — It has a number where differences or ratios of that numbers have some kind of meanings e.g. CompetitionDistancecat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear', 'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw', 'SchoolHoliday_fw', 'SchoolHoliday_bw'] contin_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC', 'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity', 'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend', 'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday', 'Promo', 'SchoolHoliday'] n = len(joined); n

Info 1

r

Numbers like Year , Month, although we could treat them as continuous, we do not have to. If we decide to make Year a categorical variable, we are telling our neural net that for every different “level”of Year (2000, 2001, 2002), you can treat it totally differently; where-else if we say it is continuous, it has to come up with some kind of smooth function to fit them. So often things that actually are continuous but do not have many distinct levels (e.g. Year, DayOfWeek), it often works better to treat them as categorical.

Info 2

r

Choosing categorical vs. continuous variable is a modeling decision you get to make. In summary, if it is categorical in the data, it has to be categorical. If it is continuous in the data, you get to pick whether to make it continuous or categorical in the model.Generally, floating point numbers are hard to make categorical as there are many levels (we call number of levels “Cardinality” — e.g. the cardinality of the day of week variable is 7).

Info 3

r

If you are using year as a category, what happens when a model encounters a year it has never seen before? [31:47] We will get there, but the short answer is that it will be treated as an unknown category. Pandas has a special category called unknown and if it sees a category it has not seen before, it gets treated as unknown.

Code

r

Loop through cat_vars and turn applicable data frame columns into categorical columns.Loop through contin_vars and set them as float32 (32 bit floating point) because that is what PyTorch expects.dep = 'Sales' joined = joined[cat_vars+contin_vars+[dep, 'Date']].copy() for v in cat_vars: joined[v] = joined[v].astype('category').cat.as_ordered() for v in contin_vars: joined[v] = joined[v].astype('float32')

6. Model Development

Start with a small sample

r

#  get_cv_indxs(n) gives back 20% of the random dataset idxs = get_cv_idxs(n, val_pct=150000/n) # DataFrame.iloc # Integer-location based indexing for selection by position. # Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it. joined_samp = joined.iloc[idxs].set_index("Date") samp_size = len(joined_samp); samp_size # Observe Data joined_samp.head(2)

process data frame)

r

Pulls out the dependent variable, puts it into a separate variable, and deletes it from the original data frame. In other words, df do not have Sales column, and y only contains Sales column.do_scale : Neural nets really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around 1. So we take our data, subtract the mean, and divide by the standard deviation to make that happen. It returns a special object which keeps track of what mean and standard deviation it used for that normalization so you can do the same to the test set later (mapper).It also handles missing values — for categorical variable, it becomes ID: 0 and other categories become 1, 2, 3, and so on. For continuous variable, it replaces the missing value with the median and create a new boolean column that says whether it was missing or not.

Creation of Validation set

r

val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

Error Measurment

r

def inv_y(a): return np.exp(a) def exp_rmspe(y_pred, targ): targ = inv_y(targ) pct_var = (targ - inv_y(y_pred))/targ return math.sqrt((pct_var**2).mean()) max_log_y = np.max(yl) y_range = (0, max_log_y*1.2)

Create Model Data object

r

As per usual, we will start by creating model data object which has a validation set, training set, and optional test set built into it. From that, we will get a learner, we will then optionally call lr_find, then call learn.fit and so forth.

Create embedding matrices

r

Embeddingparameters that we are learning that happen to end up giving us a good loss. We will discover later that these particular parameters often are human interpretable and quite interesting but that a side effect.

Create Learner for Model Data

r

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range) lr = 1e-3

Fir Learner on Validation set

Summary

r

Step 1. List categorical variable names, and list continuous variable names, and put them in a Pandas data frameStep 2. Create a list of which row indexes you want in your validation setStep 3. Call this exact line of code:md = ColumnarModelData.from_data_frame(PATH, val_idx, df,yl.astype(np.float32), cat_flds=cat_vars, bs=128,test_df=df_test)Step 4. Create a list of how big you want each embedding matrix to beStep 5. Call get_learner — you can use these exact parameters to start with:m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1,[1000,500], [0.001,0.01], y_range=y_range)Step 6. Call m.fit

Text Classification

r

We shall use a pre-trained network which at least knows how to read English. we will train a model thatpredicts a next word of a sentence (i.e. language model), and just like in computer vision, stick somenew layers on the end and ask it to predict whether something is positive or negative.

Info

r

Fine-tuning a pre-trained network is really powerful.If we can get it to learn some related tasks first, then we can use all that information to try and help it on thesecond task.After reading a thousands words knowing nothing about how English is structured or concept of a word orpunctuation, all you get is a 1 or a 0 (positive or negative).Trying to learn the entire structure of English and then how it expresses positive and negative sentiments from asingle number is just too much to expect.

1. Import Libraries

r

#To auto-reload modules in jupyter notebook (so that changes in files *.py doesn't require reloading %reload_ext autoreload %autoreload 2 %matplotlib inline from fastai.learner import * # Torch text: Py torch NLP library import torchtext from torchtext import vocab, data from torchtext.datasets import language_modeling from fastai.rnn_reg import * from fastai.rnn_train import * from fastai.nlp import * from fastai.lm_rnn import * import dill as pickle import spacy

Data Viewing

Tokenization

r

Before we can do anything with text, we have to turn it into a list of tokens.Token is basically like a word. Eventually we will turn them into a list of numbers, but the first step is to turn itinto a list of words — this is called “tokenization” in NLP.A good tokenizer will do a good job of recognizing pieces in your sentence.Each separated piece of punctuation will be separated, and each part of multi-part word will be separated asappropriate.Spacy does a lot of NLP stuff, and it has the best tokenizer . So Fast.ai library is designed to work well with theSpacey tokenizer as with torchtext.

Language Data Model
Development

Training Language Model

Testing Language Model

Sentiment Classifcation

r

We had pre-trained a language model and now we want to fine-tune it to do sentiment classification.

2. Import all main extrnal libraries

r

# Modules to Import for Stuructural Data Analysis from fastai.structured import * from fastai.column_data import * # These options determine the way floating point numbers, arrays and other NumPy objects are displayed. np.set_printoptions(threshold=50, edgeitems=20)

Recommender
System

Import Relevant
Libraries

r

# Import relevant Libraries %reload_ext autoreload %autoreload 2 %matplotlib inline from fastai.learner import * from fastai.column_data import * 

Read & Observe
Input Data

Model
Development

Colab Filter
From Scratch

Dot
product example

r

a = T([[1.,2],[3,4]]) b = T([[2.,2],[10,10]]) a,b ( 1 2 3 4 [torch.FloatTensor of size 2x2], 2 2 10 10 [torch.FloatTensor of size 2x2] )(a*b).sum(1) 6 70