机器学习之线性回归 | MrHook的时光机

Question：
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

Explanation:

Variable	Definition
User_ID	User ID
Product_ID	Product ID
Gender	Sex of User
Age	Age in bins
Occupation	Occupation (Masked)
City_Category	Category of the City (A,B,C)
Stay_In_Current_City_Years	Number of years stay in current city
Marital_Status	Marital Status
Product_Category_1	Product Category (Masked)
Product_Category_2	Product may belongs to other category also (Masked)
Product_Category_3	Product may belongs to other category also (Masked)
Purchase	Purchase Amount (Target Variable)
Your model performance will be evaluated on the basis of your prediction of the purchase amount for the test data (test.csv), which contains similar data-points as train except for their purchase amount.

Test_file
Train_file

思路：先利用panda读取训练数据，然后把数据进行转化归一，接着利用sklearn的线性回归进行模型计算，接着导入测试数据并归一化，最后进行purchase预测并写入结果文件中。

Answer:

import pandas as pd
from sklearn.linear_model import LinearRegression
import sklearn
import sklearn.preprocessing

df = pd.read_csv("train.csv")

#数据转化
gender_number = {'F':'0','M':'1'}
age_number = {'0-17':'0','18-25':'1','26-35':'2','36-45':'3','46-50':'4','51-55':'5','55+':'6'}
city_category_number = {'A':'0','B':'1','C':'2'}
stay_in_current_city_years_number = {'4+':'1','0':'0','1':'1','2':'2','3':'3'}
df['Gender'] = df['Gender'].map(gender_number)
df['Age'] = df['Age'].map(age_number)
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].map(stay_in_current_city_years_number)
df['City_Category'] = df['City_Category'].map(city_category_number)

x = df[['Gender','Age','City_Category','Occupation','Stay_In_Current_City_Years','Marital_Status','Product_Category_1']]
scaler = sklearn.preprocessing.MinMaxScaler() #归一化
x_scaler = scaler.fit_transform(x)
y = df['Purchase']


model = LinearRegression()
model.fit(x_scaler,y)
model.score(x_scaler,y)

print('Coefficient: \n',model.coef_)
print('Intercept: \n',model.intercept_)

df_test = pd.read_csv('test.csv')
df_test['Gender'] = df_test['Gender'].map(gender_number)
df_test['Age'] = df_test['Age'].map(age_number)
df_test['Stay_In_Current_City_Years'] = df_test['Stay_In_Current_City_Years'].map(stay_in_current_city_years_number)
df_test['City_Category'] = df_test['City_Category'].map(city_category_number)
x_test = df_test[['Gender','Age','City_Category','Occupation','Stay_In_Current_City_Years','Marital_Status','Product_Category_1']]
x_test_scaler = scaler.fit_transform(x_test)
y_predicted = model.predict(x_test_scaler)

df_result = pd.DataFrame({'User_ID':df_test['User_ID'],'Product_ID':df_test['Product_ID'],'Purchase':y_predicted})
print(df_result)
df_result.to_csv('result.csv')

链接：Black Friday – Like I already said – No amount of theory can beat practice. Here is a regression problem that you can try your hands on for a deeper understanding.