A Toolkit for Machine Learning & Analytics Use Cases.
This is a Splunk App which has macros, algos, lookups and searches which can be leveraged for Data Science using Splunk. The various Knowledge Objects can be used for feature engineering or combined with each other to solve various use cases in information security, business analytics, it operations, fraud, etc. The objective is to decrease the level of effort and avoid duplication of effort required to create various analytics that are common.
sourcetype=my_data | timechart count as yvalue | `lineartrend(_time,yvalue)` | timechart sum(yvalue) sum(newY)
Plotting R2
sourcetype=my_data | timechart count as yvalue | `lineartrend(_time,yvalue)` | stats first(R2)
More info: https://wiki.splunk.com/Community:Plotting_a_linear_trendline
Arguments: $1: Value to predict, $2: confidence interval, $3: relative time, $4: Number of days into the future to predict.
sourcetype=my_data | timechart span=10m avg(count) as avg_count | `forecast5w(avg_count,90.0,+1d,3)`
Arguments: $1: inbound traffic field, $2: outbound traffic field
sourcetype=my_data | `pcr(bytes_in,bytes_out)`
More info: http://qosient.com/argus/presentations/Argus.FloCon.2014.PCR.Presentation.pdf
Arguments: $1: count of success, $2: count of failure
sourcetype=my_data | `error_rate(success,failure)`
Arguments: $1: count of success, $2: count of failure
sourcetype=my_data | `error_rate(success,failure)`
Arguments: $1: http status_code field
sourcetype=my_data | `http_error(status_code)`
Pearson correlation coefficient for single pair of fields Arguments: $1: x $2: y
sourcetype=my_data | `pearsoncoorelationcoe(x,y)`
Create relative time fields: Day of week, Hour of Day, Minute of Hour, Arguments: $1 time
sourcetype=mydata | `relative_time(time)`
Parse windows 4688 process name into new features: xT_driveletter, xT_extension, xT_executiblename, etc. Arguments: $1 process
sourcetype=mydata | `xT(process)`
Arguments: $1: subnet mask field (eg: 255.255.255.0), $2: network for subnet/vlan (eg:10.1.1.0)
sourcetype=my_data | `subnet_to_cidr(subnet_mask,network)`
Arguments: $1: numerical list of dest_ports (eg: 22, 23, 443)
sourcetype=my_data | `port2service(dest_port)`
Arguments: $1: input of country data (eg: "United States", "Russian Federation" | "USA","RUS" | "US", "RU"), $2: type of country data in input (eg: country | iso3 | iso2 )
sourcetype=my_data | `adversaries(country_list,iso3)`
Arguments: $1: parsed domain (eg:amazon.com, google.com)
sourcetype=my_data | `cdn(domain)`
Arguments: $1: parsed domain (eg:amazon.com, google.com)
sourcetype=my_data | `ddns(domain)`
Arguments: $1: cipher identified by IPS/IDS (eg: TLS_ECDH_RSA_WITH_RC4_128_SHA)
sourcetype=my_data | `insecure_ciphers(cipher)`
Generate polynomial features, from scikit-learn. Parameters:
- degree : (default: 2) The degree of polynomial features to generate
- interaction_only : (default: False) generate only features involving up to defree different input features
- include_bias : (default: True) include a bias column of 1.
sourcetype=my_data | fit PolynomialFeatures a b c*
Test features using randomized logistic regression "Randomized Regression works by resampling the train data and computing a LogisticRegression on each resampling. In short, the features selected more often are good features. It is also known as stability selection." scikit-learn.
sourcetype=my_data | fit RandomizedLogisticRegression x from y z zz* into randomized_model
| summary randomized_model