机器学习:样本多分类(梧桐杯大数据比赛 代码记录)

7.7k words

前几天莫名其妙被老师抓去打了这个比赛,记录一下代码。

A榜+B榜加起来花不到一天,榜上正确率最高83%,我跑出来82%,多摸几次奖正确率应该有机会刷到更高,不过也是临时参赛,无所谓。

是我第一次打机器学习的比赛(虽然这比赛有点水),小小记录一下。

赛题主要是,给了一堆用户的数据,每个用户有40个特征,有些是类别数据,有些是数值数据。有较多缺失值。一共有6万组数据,要求对用户进行三分类。

主要的问题是数据预处理特别麻烦,因为要对每个特征手动判断类型。除此之外并没有什么特别困难的。

数据预处理

解压数据:

1
2
3
4
! pwd
! unzip -o "/home/workspace/input/人工智能赛道(河南)/人工智能赛道B榜数据.zip" -d "/home/workspace/output/data"
! unzip -o "/home/workspace/input/人工智能赛道(河南)/人工智能赛道A榜数据.zip" -d "/home/workspace/output/data"
! cp "/home/workspace/input/人工智能赛道(河南)/submitB.csv" "/home/workspace/output/submitB.csv"

读取数据,把特征数据拆出来。

1
2
3
4
5
6
7
8
import pandas as pd

data = pd.read_csv("/home/workspace/output/data/toUser/train.csv")

y = data['sample_flag']
X = data.drop(['sample_flag', 'user_id'], axis=1)

print(X.columns)

预处理备用函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
map_key_value = {}

'''
均值填充函数
'''
def mean_fill(Xcol):
global map_key_value
# mode = Xcol.mode()[0]
if Xcol.name in map_key_value:
mean = map_key_value[Xcol.name]
else:
mean = Xcol.mean()
map_key_value[Xcol.name] = mean

def _mean_fill(value):
if pd.isna(value):
value = mean
return value
Xcol = Xcol.map(_mean_fill)
return Xcol

'''
众数填充函数
'''
def mode_fill(Xcol):
global map_key_value
if Xcol.name in map_key_value:
mode = map_key_value[Xcol.name]
else:
mode = Xcol.mode()[0] if not Xcol.mode().empty else None
map_key_value[Xcol.name] = mode

def _mode_fill(value):
if pd.isna(value):
value = mode
return value
Xcol = Xcol.map(_mode_fill)
return Xcol

'''
空填充0函数
'''
def zero_fill(Xcol):
def _zero_fill(value):
if pd.isna(value):
value = 0
return value
Xcol = Xcol.map(_zero_fill)
return Xcol

划分特征,分为数值特征和分类特征:

1
2
3
4
5
6
7
8
9
10
11
12
13
'''
划分数值特征和分类特征
'''

numeric_features = ['age', 'join_date', 'change_equip_period_avg', 'term_price', 'ztc_gprs_res', 'ztc_price', 'avg3_tc_ll', 'avg3_tw_ll', 'avg3_dou',
'avg3_mou', 'avg3_llct_cnt', 'avg3_yyct_cnt', 'avg3_ll_bhd', 'avg3_sl_ll', 'll_bhd', 'sl_ll2', 'avg3_tc_price', 'avg3_ctll_fee',
'avg3_ctyy_fee', 'avg3_video_app1_cnt', 'avg3_video_app2_cnt', 'avg3_video_app_ll', 'avg3_music_app1_cnt', 'avg3_music_app2_cnt',
'avg3_music_app_ll', 'avg3_game_app1_cnt', 'avg3_game_app2_cnt', 'avg3_game_app_ll']

categorical_features = ['area_code', 'age', 'gender_id', 'zfk_type', 'user_type', 'group_type', 'jt_5gzd_flag',
'jt_5gwl_flag', 'term_brand', 'avg3_llb_flag', 'sl_flag', 'sl_type']

print(len(numeric_features) + len(categorical_features))

特殊数据处理

把时间格式转成数值类型的毫秒时间戳:

1
2
3
4
5
6
7
8
'''
时间转成毫秒时间戳
'''

X['join_date'] = pd.to_datetime(data['join_date'])
X['join_date'] = (X['join_date'] - pd.Timestamp('1970-01-01')) // pd.Timedelta('1ms')

print(X['join_date'].head())

处理年龄未知(这组数据把未知年龄设成999了):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
'''
处理年龄未知
'''
X['age'] = data['age']

age_mode = X['age'].mode()[0]
print(f"Mode of Age is {age_mode}")

def prepare_age(value):
global age_mode
if value == 999 or pd.isna(value):
value = age_mode
return value

X['age'] = X['age'].map(prepare_age)
print(X['age'])

区域号不是数值类型,转成类别类型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
'''
区域号转成分类
'''

X['area_code'] = data['area_code'].astype(str)

flag = {}
cnt = 0
def map_to_type(value):
global flag
global cnt
value = str(value)
if (value in flag) == False:
flag[value] = cnt
cnt = cnt + 1
return flag[value]

X['area_code'] = X['area_code'].map(map_to_type)
print(X['area_code'])

填充缺失值

需要缺失归零的数据:

1
2
3
4
5
6
7
'''
处理jt_5gwl_flag
'''

need_zero = ['jt_5gwl_flag','sl_flag','sl_type']
for col in need_zero:
X[col] = zero_fill(X[col])

剩下的数据,如果是类别数据就拿众数填充,如果是数值类型就拿均值填充:

1
2
3
4
5
6
7
8
'''
数值型用均值填充,类别型用众数填充
'''
for col in numeric_features:
X[col] = mean_fill(X[col])

for col in categorical_features:
X[col] = mode_fill(X[col])

最后导出来看看:

1
X.to_csv('/home/workspace/output/data/X.csv', index=False)

模型构建

标签转换,因为给定数据里,分类是1到3,但是很多模型需要分类从0开始,要做一个转换,之后转回来。

1
2
3
4
5
6
7
8
9
10
11
12
13
def transform_labels(y):
"""
将标签从 {1, 2, 3} 转换为 {0, 1, 2}
"""
return y - 1

def inverse_transform_labels(y):
"""
将标签从 {0, 1, 2} 转换回 {1, 2, 3}
"""
return y + 1

y = transform_labels(y)

模型建立与训练,这里是用随机森林,也试过其他机器学习模型,在这个问题中随机森林表现最好。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score


numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
X[categorical_features] = X[categorical_features].astype(str)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])

model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Model accuracy: {accuracy_score(y_test, y_pred)}")

这里的测试集上准确率为89%。

应用于测试数据

对测试数据进行处理:

1
2
data = pd.read_csv("/home/workspace/output/data/testB.csv")
X = data.drop(['user_id'], axis=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
X['join_date'] = pd.to_datetime(X['join_date'])
X['join_date'] = (X['join_date'] - pd.Timestamp('1970-01-01')) // pd.Timedelta('1ms')

print(X['join_date'].head())


'''
处理年龄未知
'''
X['age'] = data['age']
X['age'] = X['age'].map(prepare_age)
print(X['age'])


'''
区域号转成分类
'''

X['area_code'] = X['area_code'].astype(str)
def map_to_type(value):
value = str(value)
return flag[value]

X['area_code'] = X['area_code'].map(map_to_type)
print(X['area_code'])

'''
处理jt_5gwl_flag
'''

need_zero = ['jt_5gwl_flag','sl_flag','sl_type']
for col in need_zero:
X[col] = mode_fill(X[col])

'''
数值型用均值填充,类别型用众数填充
'''
for col in numeric_features:
X[col] = mean_fill(X[col])

for col in categorical_features:
X[col] = mode_fill(X[col])

X[categorical_features] = X[categorical_features].astype(str)

用模型预测:

1
2
3
4
5
6
7
8
9
10
11
12
! cp "/home/workspace/input/人工智能赛道(河南)/submitB.csv" "/home/workspace/output/submitB.csv"


y_pred = model.predict(X)

y_pred = inverse_transform_labels(y_pred)
print(y_pred)

df = pd.read_csv('/home/workspace/output/submitB.csv')
df['predtype'] = y_pred

df.to_csv('/home/workspace/output/submitB.csv', index=False)

检查预测出来的每种类型个数,看看是否有异常:

1
2
3
4
5
6
7
import pandas as pd
import numpy as np

y_pred_series = pd.Series(y_pred)
class_counts = y_pred_series.value_counts()

print(class_counts)

我最终的结果是:

1
2
3
4
1    53976
2 5790
3 234
dtype: int64

看得出没有很异常。

优化

打完后才想起来这个数据集还有个特征是类别不均衡。针对这个特征继续特殊处理,应该能更好。算是我的第一次打这种机器学习比赛,以后会记得做()。