Jose A Dianes

Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com

Spark & Python: MLlib Basic Statistics & Exploratory Data Analysis

Published Jul 03, 2015Last updated Feb 10, 2017

Instructions

My Spark & Python series of tutorials can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

It is not the only one but, a good way to follow these Spark tutorials is by first cloning the GitHub repo, and then starting your own IPython notebook in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.3.1-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as a requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passign options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999.

References

The reference book for these and other Spark related topics is Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

The KDD Cup 1999 competition dataset is described in detail here.

Introduction

So far we have used different map and aggregation functions, on simple and key/value pair RDD's to get simple statistics that help us understand our datasets. In this tutorial we will introduce Spark's machine learning library MLlib through its basic statistics functionality in order to better understand our dataset. We will use the reduced 10-percent KDD Cup 1999 datasets.

Getting the Data and Creating the RDD

As we did in our first notebook, we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.

import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")


data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

Local Vectors

A local vector is often used as a base type for RDDs in Spark MLlib. A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.

For dense vectors, MLlib uses either Python lists or the NumPy array type. The later is recommended, so you can simply pass NumPy arrays around.

For sparse vectors, users can construct a SparseVector object from MLlib or pass SciPy scipy.sparse column vectors if SciPy is available in their environment. The easiest way to create sparse vectors is to use the factory methods implemented in Vectors.

An RDD of Dense Vectors

Let's represent each network interaction in our dataset as a dense vector. For that we will use the NumPy array type.

import numpy as np

def parse_interaction(line):
    line_split = line.split(",")
    # keep just numeric and logical values
    symbolic_indexes = [1,2,3,41]
    clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]
    return np.array([float(x) for x in clean_line_split])

vector_data = raw_data.map(parse_interaction)

Summary Statistics

Spark's MLlib provides column summary statistics for RDD[Vector] through the function colStats available in Statistics. The method returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

from pyspark.mllib.stat import Statistics 
from math import sqrt 

# Compute column summary statistics.
summary = Statistics.colStats(vector_data)

print "Duration Statistics:"
print " Mean: {}".format(round(summary.mean()[0],3))
print " St. deviation: {}".format(round(sqrt(summary.variance()[0]),3))
print " Max value: {}".format(round(summary.max()[0],3))
print " Min value: {}".format(round(summary.min()[0],3))
print " Total value count: {}".format(summary.count())
print " Number of non-zero values: {}".format(summary.numNonzeros()[0])

Duration Statistics:  
Mean: 47.979  
St. deviation: 707.746  
Max value: 58329.0  
Min value: 0.0  
Total value count: 494021  
Number of non-zero values: 12350.0

Summary Statistics by Label

The interesting part of summary statistics, in our case, comes from being able to obtain them by the type of network attack or 'label' in our dataset. By doing so we will be able to better characterise our dataset dependent variable in terms of the independent variables range of values.

If we want to do such a thing we could filter our RDD containing labels as keys and vectors as values. For that we just need to adapt our parse_interaction function to return a tuple with both elements.

def parse_interaction_with_key(line):
    line_split = line.split(",")
    # keep just numeric and logical values
    symbolic_indexes = [1,2,3,41]
    clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]
    return (line_split[41], np.array([float(x) for x in clean_line_split]))

label_vector_data = raw_data.map(parse_interaction_with_key)

The next step is not very sofisticated. We use filter on the RDD to leave out other labels but the one we want to gather statistics from.

normal_label_data = label_vector_data.filter(lambda x: x[0]=="normal.")

Now we can use the new RDD to call colStats on the values.

normal_summary = Statistics.colStats(normal_label_data.values())

And collect the results as we did before.

print "Duration Statistics for label: {}".format("normal")
print " Mean: {}".format(normal_summary.mean()[0],3)
print " St. deviation: {}".format(round(sqrt(normal_summary.variance()[0]),3))
print " Max value: {}".format(round(normal_summary.max()[0],3))
print " Min value: {}".format(round(normal_summary.min()[0],3))
print " Total value count: {}".format(normal_summary.count())
print " Number of non-zero values: {}".format(normal_summary.numNonzeros()[0])

Duration Statistics for label: normal  
Mean: 216.657322313  
St. deviation: 1359.213  
Max value: 58329.0  
Min value: 0.0  
Total value count: 97278  
Number of non-zero values: 11690.0

Instead of working with a key/value pair we could have just filter our raw data split using the label in column 41. Then we can parse the results as we did before. This will work as well. However having our data organised as key/value pairs will open the door to better manipulations. Since values() is a transformation on an RDD, and not an action, we don't perform any computation until we call colStats anyway.

But lets wrap this within a function so we can reuse it with any label.

def summary_by_label(raw_data, label):
    label_vector_data = raw_data.map(parse_interaction_with_key).filter(lambda x: x[0]==label)
    return Statistics.colStats(label_vector_data.values())

Let's give it a try with the "normal." label again.

normal_sum = summary_by_label(raw_data, "normal.")

print "Duration Statistics for label: {}".format("normal")
print " Mean: {}".format(normal_sum.mean()[0],3)
print " St. deviation: {}".format(round(sqrt(normal_sum.variance()[0]),3))
print " Max value: {}".format(round(normal_sum.max()[0],3))
print " Min value: {}".format(round(normal_sum.min()[0],3))
print " Total value count: {}".format(normal_sum.count())
print " Number of non-zero values: {}".format(normal_sum.numNonzeros()[0])

Duration Statistics for label: normal  
Mean: 216.657322313  
St. deviation: 1359.213  
Max value: 58329.0  
Min value: 0.0  
Total value count: 97278  
Number of non-zero values: 11690.0

Let's try now with some network attack. We have all of them listed here.

guess_passwd_summary = summary_by_label(raw_data, "guess_passwd.")

print "Duration Statistics for label: {}".format("guess_password")
print " Mean: {}".format(guess_passwd_summary.mean()[0],3)
print " St. deviation: {}".format(round(sqrt(guess_passwd_summary.variance()[0]),3))
print " Max value: {}".format(round(guess_passwd_summary.max()[0],3))
print " Min value: {}".format(round(guess_passwd_summary.min()[0],3))
print " Total value count: {}".format(guess_passwd_summary.count())
print " Number of non-zero values: {}".format(guess_passwd_summary.numNonzeros()[0])

Duration Statistics for label: guess_password  
Mean: 2.71698113208  
St. deviation: 11.88  
Max value: 60.0  
Min value: 0.0  
Total value count: 53  
Number of non-zero values: 4.0

We can see that this type of attack is shorter in duration than a normal interaction. We could build a table with duration statistics for each type of interaction in our dataset. First we need to get a list of labels as described in the first line here.

label_list = ["back.","buffer_overflow.","ftp_write.",
              "guess_passwd.","imap.","ipsweep.",
              "land.","loadmodule.","multihop.",
              "neptune.","nmap.","normal.","perl.",
              "phf.","pod.","portsweep.",
              "rootkit.","satan.","smurf.","spy.",
              "teardrop.","warezclient.",
              "warezmaster."]

Then we get a list of statistics for each label.

stats_by_label = [(label, summary_by_label(raw_data, label)) for label in label_list]

Now we get the duration column, first in our dataset (i.e. index 0).

duration_by_label = [ 
    (stat[0], 
     np.array([
         float(stat[1].mean()[0]), 
         float(sqrt(stat[1].variance()[0])), 
         float(stat[1].min()[0]), 
         float(stat[1].max()[0]), 
         int(stat[1].count())])) 
    for stat in stats_by_label]

That we can put into a Pandas data frame.

import pandas as pd
pd.set_option('display.max_columns', 50)

stats_by_label_df = pd.DataFrame.from_items(duration_by_label, columns=["Mean", "Std Dev", "Min", "Max", "Count"], orient='index')

And print it.

print "Duration statistics, by label"

stats_by_label_df

Duration statistics, by label

Label	Mean	Std Dev	Min	Max	Count
back.	0.128915	1.110062	0	14	2203
buffer_overflow.	91.700000	97.514685	0	321	30
ftp_write.	32.375000	47.449033	0	134	8
guess_passwd.	2.716981	11.879811	0	60	53
imap.	6.000000	14.174240	0	41	12
ipsweep.	0.034483	0.438439	0	7	1247
land.	0.000000	0.000000	0	0	21
loadmodule.	36.222222	41.408869	0	103	9
multihop.	184.000000	253.851006	0	718	7
neptune.	0.000000	0.000000	0	0	107201
nmap.	0.000000	0.000000	0	0	231
normal.	216.657322	1359.213469	0	58329	97278
perl.	41.333333	14.843629	25	54	3
phf.	4.500000	5.744563	0	12	4
pod.	0.000000	0.000000	0	0	264
portsweep.	1915.299038	7285.125159	0	42448	1040
rootkit.	100.800000	216.185003	0	708	10
satan.	0.040277	0.522433	0	11	1589
smurf.	0.000000	0.000000	0	0	280790
spy.	318.000000	26.870058	299	337	2
teardrop.	0.000000	0.000000	0	0	979
warezclient.	615.257843	2207.694966	0	15168	1020
warezmaster.	15.050000	33.385271	0	156	20

In order to reuse this code and get a dataframe from any variable in our dataset we will define a function.

def get_variable_stats_df(stats_by_label, column_i):
    column_stats_by_label = [
        (stat[0], 
         np.array([
             float(stat[1].mean()[column_i]), 
             float(sqrt(stat[1].variance()[column_i])), 
             float(stat[1].min()[column_i]), 
             float(stat[1].max()[column_i]), 
             int(stat[1].count())])) 
        for stat in stats_by_label
    ]
    return pd.DataFrame.from_items(
        column_stats_by_label, 
        columns=["Mean", "Std Dev", "Min", "Max", "Count"], 
        orient='index')

Let's try for duration again.

get_variable_stats_df(stats_by_label,0)

Label	Mean	Std Dev	Min	Max	Count
back.	0.128915	1.110062	0	14	2203
buffer_overflow.	91.700000	97.514685	0	321	30
ftp_write.	32.375000	47.449033	0	134	8
guess_passwd.	2.716981	11.879811	0	60	53
imap.	6.000000	14.174240	0	41	12
ipsweep.	0.034483	0.438439	0	7	1247
land.	0.000000	0.000000	0	0	21
loadmodule.	36.222222	41.408869	0	103	9
multihop.	184.000000	253.851006	0	718	7
neptune.	0.000000	0.000000	0	0	107201
nmap.	0.000000	0.000000	0	0	231
normal.	216.657322	1359.213469	0	58329	97278
perl.	41.333333	14.843629	25	54	3
phf.	4.500000	5.744563	0	12	4
pod.	0.000000	0.000000	0	0	264
portsweep.	1915.299038	7285.125159	0	42448	1040
rootkit.	100.800000	216.185003	0	708	10
satan.	0.040277	0.522433	0	11	1589
smurf.	0.000000	0.000000	0	0	280790
spy.	318.000000	26.870058	299	337	2
teardrop.	0.000000	0.000000	0	0	979
warezclient.	615.257843	2207.694966	0	15168	1020
warezmaster.	15.050000	33.385271	0	156	20

Now for the next numeric column in the dataset, src_bytes.

print "src_bytes statistics, by label"
get_variable_stats_df(stats_by_label,1)

src_bytes statistics, by label

Label	Mean	Std Dev	Min	Max	Count
back.	54156.355878	3159.360232	13140	54540	2203
buffer_overflow.	1400.433333	1337.132616	0	6274	30
ftp_write.	220.750000	267.747616	0	676	8
guess_passwd.	125.339623	3.037860	104	126	53
imap.	347.583333	629.926036	0	1492	12
ipsweep.	10.083400	5.231658	0	18	1247
land.	0.000000	0.000000	0	0	21
loadmodule.	151.888889	127.745298	0	302	9
multihop.	435.142857	540.960389	0	1412	7
neptune.	0.000000	0.000000	0	0	107201
nmap.	24.116883	59.419871	0	207	231
normal.	1157.047524	34226.124718	0	2194619	97278
perl.	265.666667	4.932883	260	269	3
phf.	51.000000	0.000000	51	51	4
pod.	1462.651515	125.098044	564	1480	264
portsweep.	666707.436538	21500665.866700	0	693375640	1040
rootkit.	294.700000	538.578180	0	1727	10
satan.	1.337319	42.946200	0	1710	1589
smurf.	935.772300	200.022386	520	1032	280790
spy.	174.500000	88.388348	112	237	2
teardrop.	28.000000	0.000000	28	28	979
warezclient.	300219.562745	1200905.243130	30	5135678	1020
warezmaster.	49.300000	212.155132	0	950	20

And so on. By reusing the summary_by_label and get_variable_stats_df functions we can perform some exploratory data analysis in large datasets with Spark.

Correlations

Spark's MLlib supports Pearson’s and Spearman’s to calculate pairwise correlation methods among many series. Both of them are provided by the corr method in the Statistics package.

We have two options as input. Either two RDD[Double]s or an RDD[Vector]. In the first case the output will be a Double value, while in the second a whole correlation Matrix. Due to the nature of our data, we will obtain the second.

from pyspark.mllib.stat import Statistics 
correlation_matrix = Statistics.corr(vector_data, method="spearman")

Once we have the correlations ready, we can start inspecting their values.

import pandas as pd
pd.set_option('display.max_columns', 50)

col_names = ["duration","src_bytes","dst_bytes",
             "land","wrong_fragment",
             "urgent","hot","num_failed_logins",
             "logged_in","num_compromised",
             "root_shell","su_attempted",
             "num_root","num_file_creations",
             "num_shells","num_access_files",
             "num_outbound_cmds",
             "is_hot_login","is_guest_login","count",
             "srv_count","serror_rate",
             "srv_serror_rate","rerror_rate",
             "srv_rerror_rate","same_srv_rate",
             "diff_srv_rate","srv_diff_host_rate",
             "dst_host_count","dst_host_srv_count",
             "dst_host_same_srv_rate","dst_host_diff_srv_rate",
             "dst_host_same_src_port_rate",
             "dst_host_srv_diff_host_rate","dst_host_serror_rate",
             "dst_host_srv_serror_rate",
             "dst_host_rerror_rate","dst_host_srv_rerror_rate"]

corr_df = pd.DataFrame(
                    correlation_matrix, 
                    index=col_names, 
                    columns=col_names)

corr_df

.	duration	src_bytes	dst_bytes	land	wrong_fragment	urgent	hot	num_failed_logins	logged_in	num_compromised	root_shell	su_attempted	num_root	num_file_creations	num_shells	num_access_files	num_outbound_cmds	is_hot_login	is_guest_login	count	srv_count	serror_rate	srv_serror_rate	rerror_rate	srv_rerror_rate	same_srv_rate	diff_srv_rate	srv_diff_host_rate	dst_host_count	dst_host_srv_count	dst_host_same_srv_rate	dst_host_diff_srv_rate	dst_host_same_src_port_rate	dst_host_srv_diff_host_rate	dst_host_serror_rate	dst_host_srv_serror_rate	dst_host_rerror_rate	dst_host_srv_rerror_rate
duration	1.000000	0.014196	0.299189	-0.001068	-0.008025	0.017883	0.108639	0.014363	0.159564	0.010687	0.040425	0.026015	0.013401	0.061099	0.008632	0.019407	-0.000019	-0.000010	0.205606	-0.259032	-0.250139	-0.074211	-0.073663	-0.025936	-0.026420	0.062291	-0.050875	0.123621	-0.161107	-0.217167	-0.211979	0.231644	-0.065202	0.100692	-0.056753	-0.057298	-0.007759	-0.013891
src_bytes	0.014196	1.000000	-0.167931	-0.009404	-0.019358	0.000094	0.113920	-0.008396	-0.089702	0.118562	0.003067	0.002282	-0.002050	0.027710	0.014403	-0.001497	0.000010	0.000019	0.027511	0.666230	0.722609	-0.657460	-0.652391	-0.342180	-0.332977	0.744046	-0.739988	-0.104042	0.130377	0.741979	0.729151	-0.712965	0.815039	-0.140231	-0.645920	-0.641792	-0.297338	-0.300581
dst_bytes	0.299189	-0.167931	1.000000	-0.003040	-0.022659	0.007234	0.193156	0.021952	0.882185	0.169772	0.026054	0.012192	-0.003884	0.034154	-0.000054	0.065776	-0.000031	0.000041	0.085947	-0.639157	-0.497683	-0.205848	-0.198715	-0.100958	-0.081307	0.229677	-0.222572	0.521003	-0.611972	0.024124	0.055033	-0.035073	-0.396195	0.578557	-0.167047	-0.158378	-0.003042	0.001621
land	-0.001068	-0.009404	-0.003040	1.000000	-0.000333	-0.000065	-0.000539	-0.000076	-0.002785	-0.000447	-0.000093	-0.000049	-0.000230	-0.000150	-0.000076	-0.000211	-0.002881	0.002089	-0.000250	-0.010939	-0.010128	0.014160	0.014342	-0.000451	-0.001690	0.002153	-0.001846	0.020678	-0.019923	-0.012341	0.002576	-0.001803	0.004265	0.016171	0.013566	0.012265	0.000389	-0.001816
wrong_fragment	-0.008025	-0.019358	-0.022659	-0.000333	1.000000	-0.000150	-0.004042	-0.000568	-0.020911	-0.003370	-0.000528	-0.000248	-0.001727	-0.001160	-0.000507	-0.001519	-0.000147	0.000441	-0.001869	-0.057711	-0.029117	-0.008849	-0.023382	0.000430	-0.012676	0.010218	-0.009386	0.012117	-0.029149	-0.058225	-0.049560	0.055542	-0.015449	0.007306	0.010387	-0.024117	0.046656	-0.013666
urgent	0.017883	0.000094	0.007234	-0.000065	-0.000150	1.000000	0.008594	0.063009	0.006821	0.031765	0.067437	0.000020	0.061994	0.061383	-0.000066	0.023380	0.012879	0.005162	-0.000100	-0.004778	-0.004799	-0.001338	-0.001327	-0.000705	-0.000726	0.001521	-0.001522	-0.000788	-0.005894	-0.005698	-0.004078	0.005208	-0.001939	-0.000976	-0.001381	-0.001370	-0.000786	-0.000782
hot	0.108639	0.113920	0.193156	-0.000539	-0.004042	0.008594	1.000000	0.112560	0.189126	0.811529	0.101983	-0.000400	0.003096	0.028694	0.009146	0.004224	-0.000393	-0.000248	0.463706	-0.120847	-0.114735	-0.035487	-0.034934	0.013468	0.052003	0.041342	-0.040555	0.032141	-0.074178	-0.017960	0.018783	-0.017198	-0.086998	-0.014141	-0.004706	-0.010721	0.199019	0.189142
num_failed_logins	0.014363	-0.008396	0.021952	-0.000076	-0.000568	0.063009	0.112560	1.000000	-0.002190	0.004619	0.016895	0.072748	0.010060	0.015211	-0.000093	0.005581	0.003431	-0.001560	-0.000428	-0.018024	-0.018027	-0.003674	-0.004027	0.035324	0.034876	0.005716	-0.005538	-0.003096	-0.028369	-0.015092	0.003004	-0.002960	-0.006617	-0.002588	0.014713	0.014914	0.032395	0.032151
logged_in	0.159564	-0.089702	0.882185	-0.002785	-0.020911	0.006821	0.189126	-0.002190	1.000000	0.161190	0.025293	0.011813	0.082533	0.055530	0.024354	0.072698	0.000079	0.000127	0.089318	-0.578287	-0.438947	-0.187114	-0.180122	-0.091962	-0.072287	0.216969	-0.214019	0.503807	-0.682721	0.080352	0.114526	-0.093565	-0.359506	0.659078	-0.143283	-0.132474	0.007236	0.012979
num_compromised	0.010687	0.118562	0.169772	-0.000447	-0.003370	0.031765	0.811529	0.004619	0.161190	1.000000	0.085558	0.048985	0.028557	0.031223	0.011256	0.006977	0.001048	-0.000438	-0.002504	-0.097212	-0.091154	-0.030516	-0.030264	0.008573	0.054006	0.035253	-0.034953	0.036497	-0.041615	0.003465	0.038980	-0.039091	-0.078843	-0.020979	-0.005019	-0.004504	0.214115	0.217858
root_shell	0.040425	0.003067	0.026054	-0.000093	-0.000528	0.067437	0.101983	0.016895	0.025293	0.085558	1.000000	0.233486	0.094512	0.140650	0.132056	0.069353	0.011462	-0.006602	-0.000405	-0.016409	-0.015174	-0.004952	-0.004923	-0.001104	-0.001143	0.004946	-0.004553	0.002286	-0.021367	-0.011906	0.000515	-0.000916	-0.004617	0.008631	-0.003498	-0.003032	0.002763	0.002151
su_attempted	0.026015	0.002282	0.012192	-0.000049	-0.000248	0.000020	-0.000400	0.072748	0.011813	0.048985	0.233486	1.000000	0.119326	0.053110	0.040487	0.081272	-0.018896	0.012927	-0.000219	-0.008279	-0.008225	-0.002318	-0.002295	-0.001227	-0.001253	0.002634	-0.002649	0.000348	-0.006697	-0.006288	-0.005738	0.006687	-0.005020	0.001052	0.001974	0.002893	0.003173	0.001731
num_root	0.013401	-0.002050	-0.003884	-0.000230	-0.001727	0.061994	0.003096	0.010060	0.082533	0.028557	0.094512	0.119326	1.000000	0.047521	0.034405	0.014513	0.001524	-0.002585	-0.001281	-0.054721	-0.053530	-0.016031	-0.015936	-0.008610	-0.008708	0.013881	-0.011337	0.006316	-0.078717	-0.038689	-0.038935	0.047414	-0.015968	0.061030	-0.008457	-0.007096	-0.000421	-0.005012
num_file_creations	0.061099	0.027710	0.034154	-0.000150	-0.001160	0.061383	0.028694	0.015211	0.055530	0.031223	0.140650	0.053110	0.047521	1.000000	0.068660	0.031042	-0.004081	-0.001664	0.013242	-0.036467	-0.034598	-0.009703	-0.010390	-0.005069	-0.004775	0.009784	-0.008711	0.014412	-0.049529	-0.026890	-0.021731	0.027092	-0.015018	0.030590	-0.002257	-0.004295	0.000626	-0.001096
num_shells	0.008632	0.014403	-0.000054	-0.000076	-0.000507	-0.000066	0.009146	-0.000093	0.024354	0.011256	0.132056	0.040487	0.034405	0.068660	1.000000	0.019438	-0.002592	-0.006631	-0.000405	-0.013938	-0.011784	-0.004343	-0.004740	-0.002541	-0.002572	0.004282	-0.003743	0.001096	-0.021200	-0.012017	-0.009962	0.010761	-0.003521	0.015882	-0.001588	-0.002357	-0.000617	-0.002020
num_access_files	0.019407	-0.001497	0.065776	-0.000211	-0.001519	0.023380	0.004224	0.005581	0.072698	0.006977	0.069353	0.081272	0.014513	0.031042	0.019438	1.000000	-0.001597	-0.002850	0.002466	-0.045282	-0.040497	-0.013945	-0.013572	-0.007581	0.001874	0.015499	-0.015112	0.024266	-0.023865	-0.023657	-0.021358	0.026703	-0.033288	0.011765	-0.011197	-0.011487	-0.004743	-0.004552
num_outbound_cmds	-0.000019	0.000010	-0.000031	-0.002881	-0.000147	0.012879	-0.000393	0.003431	0.000079	0.001048	0.011462	-0.018896	0.001524	-0.004081	-0.002592	-0.001597	1.000000	0.822890	0.000924	-0.000076	0.000100	0.000167	0.000209	0.000536	0.000346	0.000208	0.000328	-0.000141	-0.000424	-0.000280	-0.000503	-0.000181	-0.000455	0.000288	-0.000011	-0.000372	-0.000823	-0.001038
is_hot_login	-0.000010	0.000019	0.000041	0.002089	0.000441	0.005162	-0.000248	-0.001560	0.000127	-0.000438	-0.006602	0.012927	-0.002585	-0.001664	-0.006631	-0.002850	0.822890	1.000000	0.001512	0.000036	0.000064	0.000102	-0.000302	-0.000550	0.000457	-0.000159	-0.000235	-0.000360	-0.000106	0.000206	0.000229	-0.000004	0.000283	0.000538	-0.000076	-0.000007	-0.000435	-0.000529
is_guest_login	0.205606	0.027511	0.085947	-0.000250	-0.001869	-0.000100	0.463706	-0.000428	0.089318	-0.002504	-0.000405	-0.000219	-0.001281	0.013242	-0.000405	0.002466	0.000924	0.001512	1.000000	-0.062340	-0.062713	-0.017343	-0.017240	-0.008867	-0.009193	0.018042	-0.017000	-0.008878	-0.055453	-0.044366	-0.041749	0.044640	-0.038092	-0.012578	-0.001066	-0.016885	0.025282	-0.004292
count	-0.259032	0.666230	-0.639157	-0.010939	-0.057711	-0.004778	-0.120847	-0.018024	-0.578287	-0.097212	-0.016409	-0.008279	-0.054721	-0.036467	-0.013938	-0.045282	-0.000076	0.000036	-0.062340	1.000000	0.950587	-0.303538	-0.308923	-0.213824	-0.221352	0.346718	-0.361737	-0.384010	0.547443	0.586979	0.539698	-0.546869	0.776906	-0.496554	-0.331571	-0.335290	-0.261194	-0.256176
srv_count	-0.250139	0.722609	-0.497683	-0.010128	-0.029117	-0.004799	-0.114735	-0.018027	-0.438947	-0.091154	-0.015174	-0.008225	-0.053530	-0.034598	-0.011784	-0.040497	0.000100	0.000064	-0.062713	0.950587	1.000000	-0.428185	-0.421424	-0.281468	-0.284034	0.517227	-0.511998	-0.239057	0.442611	0.720746	0.681955	-0.673916	0.812280	-0.391712	-0.449096	-0.442823	-0.313442	-0.308132
serror_rate	-0.074211	-0.657460	-0.205848	0.014160	-0.008849	-0.001338	-0.035487	-0.003674	-0.187114	-0.030516	-0.004952	-0.002318	-0.016031	-0.009703	-0.004343	-0.013945	0.000167	0.000102	-0.017343	-0.303538	-0.428185	1.000000	0.990888	-0.091157	-0.095285	-0.851915	0.828012	-0.121489	0.165350	-0.724317	-0.745745	0.719708	-0.650336	-0.153568	0.973947	0.965663	-0.103198	-0.105434
srv_serror_rate	-0.073663	-0.652391	-0.198715	0.014342	-0.023382	-0.001327	-0.034934	-0.004027	-0.180122	-0.030264	-0.004923	-0.002295	-0.015936	-0.010390	-0.004740	-0.013572	0.000209	-0.000302	-0.017240	-0.308923	-0.421424	0.990888	1.000000	-0.110664	-0.115286	-0.839315	0.815305	-0.112222	0.160322	-0.713313	-0.734334	0.707753	-0.646256	-0.148072	0.967214	0.970617	-0.122630	-0.124656
rerror_rate	-0.025936	-0.342180	-0.100958	-0.000451	0.000430	-0.000705	0.013468	0.035324	-0.091962	0.008573	-0.001104	-0.001227	-0.008610	-0.005069	-0.002541	-0.007581	0.000536	-0.000550	-0.008867	-0.213824	-0.281468	-0.091157	-0.110664	1.000000	0.978813	-0.327986	0.345571	-0.017902	-0.067857	-0.330391	-0.303126	0.308722	-0.278465	0.073061	-0.094076	-0.110646	0.910225	0.911622
srv_rerror_rate	-0.026420	-0.332977	-0.081307	-0.001690	-0.012676	-0.000726	0.052003	0.034876	-0.072287	0.054006	-0.001143	-0.001253	-0.008708	-0.004775	-0.002572	0.001874	0.000346	0.000457	-0.009193	-0.221352	-0.284034	-0.095285	-0.115286	0.978813	1.000000	-0.316568	0.333439	0.011285	-0.072595	-0.323032	-0.294328	0.300186	-0.282239	0.075178	-0.096146	-0.114341	0.904591	0.914904
same_srv_rate	0.062291	0.744046	0.229677	0.002153	0.010218	0.001521	0.041342	0.005716	0.216969	0.035253	0.004946	0.002634	0.013881	0.009784	0.004282	0.015499	0.000208	-0.000159	0.018042	0.346718	0.517227	-0.851915	-0.839315	-0.327986	-0.316568	1.000000	-0.982109	0.140660	-0.190121	0.848754	0.873551	-0.844537	0.732841	0.179040	-0.830067	-0.819335	-0.282487	-0.282913
diff_srv_rate	-0.050875	-0.739988	-0.222572	-0.001846	-0.009386	-0.001522	-0.040555	-0.005538	-0.214019	-0.034953	-0.004553	-0.002649	-0.011337	-0.008711	-0.003743	-0.015112	0.000328	-0.000235	-0.017000	-0.361737	-0.511998	0.828012	0.815305	0.345571	0.333439	-0.982109	1.000000	-0.138293	0.185942	-0.844028	-0.868580	0.850911	-0.727031	-0.176930	0.807205	0.795844	0.299041	0.298904
srv_diff_host_rate	0.123621	-0.104042	0.521003	0.020678	0.012117	-0.000788	0.032141	-0.003096	0.503807	0.036497	0.002286	0.000348	0.006316	0.014412	0.001096	0.024266	-0.000141	-0.000360	-0.008878	-0.384010	-0.239057	-0.121489	-0.112222	-0.017902	0.011285	0.140660	-0.138293	1.000000	-0.445051	0.035010	0.068648	-0.050472	-0.222707	0.433173	-0.097973	-0.092661	0.022585	0.024722
dst_host_count	-0.161107	0.130377	-0.611972	-0.019923	-0.029149	-0.005894	-0.074178	-0.028369	-0.682721	-0.041615	-0.021367	-0.006697	-0.078717	-0.049529	-0.021200	-0.023865	-0.000424	-0.000106	-0.055453	0.547443	0.442611	0.165350	0.160322	-0.067857	-0.072595	-0.190121	0.185942	-0.445051	1.000000	0.022731	-0.070448	0.044338	0.189876	-0.918894	0.123881	0.113845	-0.125142	-0.125273
dst_host_srv_count	-0.217167	0.741979	0.024124	-0.012341	-0.058225	-0.005698	-0.017960	-0.015092	0.080352	0.003465	-0.011906	-0.006288	-0.038689	-0.026890	-0.012017	-0.023657	-0.000280	0.000206	-0.044366	0.586979	0.720746	-0.724317	-0.713313	-0.330391	-0.323032	0.848754	-0.844028	0.035010	0.022731	1.000000	0.970072	-0.955178	0.769481	0.043668	-0.722607	-0.708392	-0.312040	-0.300787
dst_host_same_srv_rate	-0.211979	0.729151	0.055033	0.002576	-0.049560	-0.004078	0.018783	0.003004	0.114526	0.038980	0.000515	-0.005738	-0.038935	-0.021731	-0.009962	-0.021358	-0.000503	0.000229	-0.041749	0.539698	0.681955	-0.745745	-0.734334	-0.303126	-0.294328	0.873551	-0.868580	0.068648	-0.070448	0.970072	1.000000	-0.980245	0.771158	0.107926	-0.742045	-0.725272	-0.278068	-0.264383
dst_host_diff_srv_rate	0.231644	-0.712965	-0.035073	-0.001803	0.055542	0.005208	-0.017198	-0.002960	-0.093565	-0.039091	-0.000916	0.006687	0.047414	0.027092	0.010761	0.026703	-0.000181	-0.000004	0.044640	-0.546869	-0.673916	0.719708	0.707753	0.308722	0.300186	-0.844537	0.850911	-0.050472	0.044338	-0.955178	-0.980245	1.000000	-0.766402	-0.088665	0.719275	0.701149	0.287476	0.271067
dst_host_same_src_port_rate	-0.065202	0.815039	-0.396195	0.004265	-0.015449	-0.001939	-0.086998	-0.006617	-0.359506	-0.078843	-0.004617	-0.005020	-0.015968	-0.015018	-0.003521	-0.033288	-0.000455	0.000283	-0.038092	0.776906	0.812280	-0.650336	-0.646256	-0.278465	-0.282239	0.732841	-0.727031	-0.222707	0.189876	0.769481	0.771158	-0.766402	1.000000	-0.175310	-0.658737	-0.652636	-0.299273	-0.297100
dst_host_srv_diff_host_rate	0.100692	-0.140231	0.578557	0.016171	0.007306	-0.000976	-0.014141	-0.002588	0.659078	-0.020979	0.008631	0.001052	0.061030	0.030590	0.015882	0.011765	0.000288	0.000538	-0.012578	-0.496554	-0.391712	-0.153568	-0.148072	0.073061	0.075178	0.179040	-0.176930	0.433173	-0.918894	0.043668	0.107926	-0.088665	-0.175310	1.000000	-0.118697	-0.103715	0.114971	0.120767
dst_host_serror_rate	-0.056753	-0.645920	-0.167047	0.013566	0.010387	-0.001381	-0.004706	0.014713	-0.143283	-0.005019	-0.003498	0.001974	-0.008457	-0.002257	-0.001588	-0.011197	-0.000011	-0.000076	-0.001066	-0.331571	-0.449096	0.973947	0.967214	-0.094076	-0.096146	-0.830067	0.807205	-0.097973	0.123881	-0.722607	-0.742045	0.719275	-0.658737	-0.118697	1.000000	0.968015	-0.087531	-0.096899
dst_host_srv_serror_rate	-0.057298	-0.641792	-0.158378	0.012265	-0.024117	-0.001370	-0.010721	0.014914	-0.132474	-0.004504	-0.003032	0.002893	-0.007096	-0.004295	-0.002357	-0.011487	-0.000372	-0.000007	-0.016885	-0.335290	-0.442823	0.965663	0.970617	-0.110646	-0.114341	-0.819335	0.795844	-0.092661	0.113845	-0.708392	-0.725272	0.701149	-0.652636	-0.103715	0.968015	1.000000	-0.111578	-0.110532
dst_host_rerror_rate	-0.007759	-0.297338	-0.003042	0.000389	0.046656	-0.000786	0.199019	0.032395	0.007236	0.214115	0.002763	0.003173	-0.000421	0.000626	-0.000617	-0.004743	-0.000823	-0.000435	0.025282	-0.261194	-0.313442	-0.103198	-0.122630	0.910225	0.904591	-0.282487	0.299041	0.022585	-0.125142	-0.312040	-0.278068	0.287476	-0.299273	0.114971	-0.087531	-0.111578	1.000000	0.950964
dst_host_srv_rerror_rate	-0.013891	-0.300581	0.001621	-0.001816	-0.013666	-0.000782	0.189142	0.032151	0.012979	0.217858	0.002151	0.001731	-0.005012	-0.001096	-0.002020	-0.004552	-0.001038	-0.000529	-0.004292	-0.256176	-0.308132	-0.105434	-0.124656	0.911622	0.914904	-0.282913	0.298904	0.024722	-0.125273	-0.300787	-0.264383	0.271067	-0.297100	0.120767	-0.096899	-0.110532	0.950964	1.000000

We have used a Pandas DataFrame here to render the correlation matrix in a more comprehensive way. Now we want those variables that are highly correlated. For that we do a bit of dataframe manipulation.

# get a boolean dataframe where true means that 
# a pair of variables is highly correlated
highly_correlated_df = (abs(corr_df) > .8) & (corr_df < 1.0)

# get the names of the variables so we can use 
# them to slice the dataframe
correlated_vars_index = (highly_correlated_df==True).any()
correlated_var_names = correlated_vars_index[correlated_vars_index==True].index

# slice it
highly_correlated_df.loc[correlated_var_names,correlated_var_names]

.	src_bytes	dst_bytes	hot	logged_in	num_compromised	num_outbound_cmds	is_hot_login	count	srv_count	serror_rate	srv_serror_rate	rerror_rate	srv_rerror_rate	same_srv_rate	diff_srv_rate	dst_host_count	dst_host_srv_count	dst_host_same_srv_rate	dst_host_diff_srv_rate	dst_host_same_src_port_rate	dst_host_srv_diff_host_rate	dst_host_serror_rate	dst_host_srv_serror_rate	dst_host_rerror_rate	dst_host_srv_rerror_rate
src_bytes	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False
dst_bytes	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
hot	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
logged_in	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
num_compromised	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
num_outbound_cmds	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
is_hot_login	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
count	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
srv_count	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False
serror_rate	False	False	False	False	False	False	False	False	False	False	True	False	False	True	True	False	False	False	False	False	False	True	True	False	False
srv_serror_rate	False	False	False	False	False	False	False	False	False	True	False	False	False	True	True	False	False	False	False	False	False	True	True	False	False
rerror_rate	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	True	True
srv_rerror_rate	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	True	True
same_srv_rate	False	False	False	False	False	False	False	False	False	True	True	False	False	False	True	False	True	True	True	False	False	True	True	False	False
diff_srv_rate	False	False	False	False	False	False	False	False	False	True	True	False	False	True	False	False	True	True	True	False	False	True	False	False	False
dst_host_count	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False
dst_host_srv_count	False	False	False	False	False	False	False	False	False	False	False	False	False	True	True	False	False	True	True	False	False	False	False	False	False
dst_host_same_srv_rate	False	False	False	False	False	False	False	False	False	False	False	False	False	True	True	False	True	False	True	False	False	False	False	False	False
dst_host_diff_srv_rate	False	False	False	False	False	False	False	False	False	False	False	False	False	True	True	False	True	True	False	False	False	False	False	False	False
dst_host_same_src_port_rate	True	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
dst_host_srv_diff_host_rate	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False
dst_host_serror_rate	False	False	False	False	False	False	False	False	False	True	True	False	False	True	True	False	False	False	False	False	False	False	True	False	False
dst_host_srv_serror_rate	False	False	False	False	False	False	False	False	False	True	True	False	False	True	False	False	False	False	False	False	False	True	False	False	False
dst_host_rerror_rate	False	False	False	False	False	False	False	False	False	False	False	True	True	False	False	False	False	False	False	False	False	False	False	False	True
dst_host_srv_rerror_rate	False	False	False	False	False	False	False	False	False	False	False	True	True	False	False	False	False	False	False	False	False	False	False	True	False

Conclusions

Possible Model Selection Hints

The previous dataframe showed us which variables are highly correlated. We have kept just those variables with at least one strong correlation. We can use as we please, but a good way could be to do some model selection. That is, if we have a group of variables that are highly correlated, we can keep just one of them to represent the group under the assumption that they convey similar information as predictors. Reducing the number of variables will not improve our model accuracy, but it will make it easier to understand and also more efficient to compute.

For example, from the description of the KDD Cup 99 task we know that the variable dst_host_same_src_port_rate references the percentage of the last 100 connections to the same port, for the same destination host. In our correlation matrix (and auxiliar dataframes) we find that this one is highly and positively correlated to src_bytes and srv_count. The former is the number of bytes sent form source to destination. The later is the number of connections to the same service as the current connection in the past 2 seconds. We might decide not to include dst_host_same_src_port_rate in our model if we include the other two, as a way to reduce the number of variables and later one better interpret our models.

Later on, in those notebooks dedicated to build predictive models, we will make use of this information to build more interpretable models.

Apache spark Python Data analysis Data Science

Report

Enjoy this post? Give Jose A Dianes a like if it's helpful.

Jose A Dianes

Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com

With more than a decade of experience, I have been involved in different aspects of Computer Science, Machine Learning, and Data Analytics applied to domains such as Life Sciences, Ambient Sensing, and Real-time Simulators. I a...

Discover and read more posts from Jose A Dianes

get started

2Replies

Lei Li

10 years ago

Hi,my friends, nice to see you again.

I have a question about the code in the “An RDD of dense vectors”:

clean_line_split = [item for i,item in enumerate(line_split) if i not in symbolic_indexes]

I can’t understand what does it mean exactly.Especially,I am confused about the code of for and if in the

[…]，So could you please help?Thank you:)

Lei Li

10 years ago

OK,I know what you want to do by myself,Thank u:)