Browse Source

WekaPy v1.3.0

- Resolve data decoding issues.
- Updated to Python 3 standards
- Updated Examples
- Tested with Weka 3.9.1
pull/1/head
Faiz Siddiqui 3 years ago
parent
commit
a68a2f99ae
5 changed files with 177 additions and 133 deletions
  1. 14
      README.md
  2. 24
      examples/example1.py
  3. 12
      examples/load_test_model_example.py
  4. 12
      examples/train_save_model_example.py
  5. 248
      wekapy.py

14
README.md

@ -1,11 +1,11 @@
WekaPy v1.2.1
WekaPy v1.3.0
=================
A simple Python module to provide a wrapper for some of the basic functionality of the Weka toolkit. The project focuses on the *classification* side of Weka, and does not consider clustering, distributions or any visualisation functions at this stage.
Weka is a machine learning tool, allowing you to classify data based on a set of its attributes and for generating predictions for unseen feature instances.
This module abstracts the use of ARFF files, making Weka much easier to use programmatically in Python.
This module abstracts the use of ARFF files, making Weka much easier to use programmatically in Python.
Please note that this project is in very early stages of development and probably will not work in some cases.
@ -33,7 +33,7 @@ Please note that this project is in very early stages of development and probabl
* Exporting data to ARFF format
* WekaPy can generate ARFF files for your training and/or test data
* Useful on its own if you'd rather use the GUI for making classifications
* Filter data
* Filter data
* WekaPy can be used to filter input data prior to training/testing models
* Can use any of the weka.filters classes to filter specified input data
@ -118,7 +118,7 @@ model.train(training_file = "train.arff")
**2.2 Using the Instance object**
If you would rather carry this out programmatically, then you can instead provide a list of Instance objects.
If you would rather carry this out programmatically, then you can instead provide a list of Instance objects.
An Instance simply contains a list of Features, and can be instantiated as follows:
```python
@ -127,7 +127,7 @@ instance1 = Instance()
feature1 = Feature(name="num_milkshakes",value=46,possible_values="real")
feature2 = Feature(name="is_sunny",value=True,possible_values="{False, True}")
feature3 = Feature(name="boys_in_yard",value=True,possible_values="{False ,True}")
instance1.add_feature(feature1)
instance1.add_feature(feature2)
instance1.add_feature(feature3)
@ -216,7 +216,7 @@ As before, an ARFF file is generated and this is used to test against the model.
You can specify the use of a different model for testing against, and thus skip out the `train()` section, if you desire. This could be useful if you have already used `train()` and chose to save the model elsewhere, you have trained the model using Weka's GUI, using someone else's model, etc.
* `model_file` (`None` by default)
* Set `model_file = "path/to/model.model"` to test with this model instead.
* Set `model_file = "path/to/model.model"` to test with this model instead.
* Any models trained previously will be discarded by the current `Model` object and replaced by this one.
* `instances`
* Pass a list of Instances to `test()` instead of using the `add_test_instance()` method demonstrated in 3.2.
@ -251,5 +251,3 @@ For each Prediction object, these fields are available:
----------------------
Occasionally it may be necessary to carry out some filtering on input data prior to training/testing a model. For example, it may be necessary to reduce the number of attributes in the input data, or split the data into training/testing instances.

24
examples/example1.py

@ -16,14 +16,14 @@ model = Model(classifier_type = "bayes.BayesNet")
# CREATE TRAINING INSTANCES. LAST FEATURE IS THE PREDICTION OUTCOME
instance1 = Instance()
instance1.add_feature(Feature(name="num_milkshakes",value=46,possible_values="real"))
instance1.add_feature(Feature(name="is_sunny",value=True,possible_values="{False, True}"))
instance1.add_feature(Feature(name="boys_in_yard",value=True,possible_values="{False ,True}"))
instance1.add_features([ Feature(name="num_milkshakes", value=46, possible_values="numeric"),
Feature(name="is_sunny", value=True, possible_values="{False, True}"),
Feature(name="boys_in_yard", value=True, possible_values="{False ,True}") ])
instance2 = Instance()
instance2.add_feature(Feature(name="num_milkshakes",value=2,possible_values="real"))
instance2.add_feature(Feature(name="is_sunny",value=False,possible_values="{False, True}"))
instance2.add_feature(Feature(name="boys_in_yard",value=False,possible_values="{False, True}"))
instance2.add_features([ Feature(name="num_milkshakes", value=2, possible_values="numeric"),
Feature(name="is_sunny", value=False, possible_values="{False, True}"),
Feature(name="boys_in_yard", value=False, possible_values="{False ,True}") ])
model.add_train_instance(instance1)
model.add_train_instance(instance2)
@ -34,14 +34,14 @@ model.train(folds=2)
# CREATE TEST INSTANCES
test_instance1 = Instance()
test_instance1.add_feature(Feature(name="num_milkshakes",value=44,possible_values="real"))
test_instance1.add_feature(Feature(name="is_sunny",value=True,possible_values="{False, True}"))
test_instance1.add_feature(Feature(name="boys_in_yard",value="?",possible_values="{False, True}"))
test_instance1.add_features([ Feature(name="num_milkshakes", value=44, possible_values="numeric"),
Feature(name="is_sunny", value=True, possible_values="{False, True}"),
Feature(name="boys_in_yard", value="?", possible_values="{False ,True}") ])
test_instance2 = Instance()
test_instance2.add_feature(Feature(name="num_milkshakes",value=5,possible_values="real"))
test_instance2.add_feature(Feature(name="is_sunny",value=False,possible_values="{False, True}"))
test_instance2.add_feature(Feature(name="boys_in_yard",value="?",possible_values="{False, True}"))
test_instance2.add_features([ Feature(name="num_milkshakes", value=5, possible_values="numeric"),
Feature(name="is_sunny", value=False, possible_values="{False, True}"),
Feature(name="boys_in_yard", value="?", possible_values="{False ,True}") ])
model.add_test_instance(test_instance1)
model.add_test_instance(test_instance2)

12
examples/load_test_model_example.py

@ -16,14 +16,14 @@ model.load_model("/path/to/model.model")
# CREATE TEST INSTANCES
test_instance1 = Instance()
test_instance1.add_feature(Feature(name="num_milkshakes",value=44,possible_values="real"))
test_instance1.add_feature(Feature(name="is_sunny",value=True,possible_values="{False, True}"))
test_instance1.add_feature(Feature(name="boys_in_yard",value="?",possible_values="{False, True}"))
test_instance1.add_features([ Feature(name="num_milkshakes", value=44, possible_values="numeric"),
Feature(name="is_sunny", value=True, possible_values="{False, True}"),
Feature(name="boys_in_yard", value="?", possible_values="{False ,True}") ])
test_instance2 = Instance()
test_instance2.add_feature(Feature(name="num_milkshakes",value=5,possible_values="real"))
test_instance2.add_feature(Feature(name="is_sunny",value=False,possible_values="{False, True}"))
test_instance2.add_feature(Feature(name="boys_in_yard",value="?",possible_values="{False, True}"))
test_instance2.add_features([ Feature(name="num_milkshakes", value=5, possible_values="numeric"),
Feature(name="is_sunny", value=False, possible_values="{False, True}"),
Feature(name="boys_in_yard", value="?", possible_values="{False ,True}") ])
model.add_test_instance(test_instance1)
model.add_test_instance(test_instance2)

12
examples/train_save_model_example.py

@ -11,14 +11,14 @@ model = Model(classifier_type = "bayes.BayesNet")
# CREATE TRAINING INSTANCES. LAST FEATURE IS THE PREDICTION OUTCOME
instance1 = Instance()
instance1.add_feature(Feature(name="num_milkshakes",value=46,possible_values="real"))
instance1.add_feature(Feature(name="is_sunny",value=True,possible_values="{False, True}"))
instance1.add_feature(Feature(name="boys_in_yard",value=True,possible_values="{False ,True}"))
instance1.add_features([ Feature(name="num_milkshakes", value=46, possible_values="numeric"),
Feature(name="is_sunny", value=True, possible_values="{False, True}"),
Feature(name="boys_in_yard", value=True, possible_values="{False ,True}") ])
instance2 = Instance()
instance2.add_feature(Feature(name="num_milkshakes",value=2,possible_values="real"))
instance2.add_feature(Feature(name="is_sunny",value=False,possible_values="{False, True}"))
instance2.add_feature(Feature(name="boys_in_yard",value=False,possible_values="{False, True}"))
instance2.add_features([ Feature(name="num_milkshakes", value=2, possible_values="numeric"),
Feature(name="is_sunny", value=False, possible_values="{False, True}"),
Feature(name="boys_in_yard", value=False, possible_values="{False ,True}") ])
model.add_train_instance(instance1)
model.add_train_instance(instance2)

248
wekapy.py

@ -1,3 +1,4 @@
# Portions of this software Copyright 2017 Faiz Siddiqui
# Portions of this software Copyright 2013 Will Webberley
# Portions of this software Copyright 2013 Martin Chorley
#
@ -20,19 +21,25 @@ import time
import uuid
import random
def decode_data(data):
return data.decode('utf-8').strip()
def run_process(options):
start_time = time.time()
process = subprocess.Popen(options, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
process_output, process_error = process.communicate()
if "Exception" in process_error:
process_output, process_error = map(decode_data, process.communicate())
if any(word in process_error for word in ["Exception", "Error"]):
for line in process_error.split("\n"):
if "Exception" in line:
raise WekapyException(line.split(' ',1)[1])
if any(word in line for word in ["Exception", "Error"]):
raise WekapyException(line.split(' ', 1)[1])
end_time = time.time()
return process_output, end_time-start_time
return process_output, end_time - start_time
# Prediction class
#
#
# Used internally and externally to WekaPy to represent a Prediction made as
# a result of running test data through a trained classifier.
# Each prediction effectively represents the classification of a set of instances.
@ -45,10 +52,10 @@ class Prediction:
self.predicted_value = pred_2
self.error = bool(error)
self.probability = float(prob)
def __str__(self):
return_s = str(self.index)+":\t"
return_s = return_s+"observed: "+str(self.observed_value)+"\tpredicted: "+str(self.predicted_value)+"\tprob: "+str(self.probability)
return return_s
return "{}:\tobserved: {}\tpredicted: {}\tprob: {}".format(str(self.index), str(self.observed_value),
str(self.predicted_value), str(self.probability))
# Feature class
@ -56,38 +63,46 @@ class Prediction:
# Used internally and externally to represent a feature of data.
# Each feature should contain a name and a value (for example, name = 'daylight_hours', value = 10)
# possible_values should be represented by a String type object indicating the possible feature values
# e.g. real, {true, false}, {0,1,2}, {tom, dick, harry}, etc.
# e.g. numeric, <nominal-specification>, string, date [<date-format>] etc.
class Feature:
def __init__(self, name = None, value = None, possible_values=None):
def __init__(self, name=None, value=None, possible_values=None):
self.name = name
self.value = value
self.possible_values = possible_values
# Instance class
#
# Used internally and externally to represent a set of Feature objects.
# Essentially, an Instance object just maintains a list of Features.
class Instance:
def __init__(self, features = None):
def __init__(self, features=None):
self.features = features
if features == None:
if features is None:
self.features = []
def add_feature(self, feature):
if isinstance(feature, Feature):
self.features.append(feature)
self.features.append(feature)
else:
raise WekapyException("Argument 'feature' must be of type Feature.")
def add_features(self, features_list):
for feature in features_list:
if isinstance(feature, Feature):
self.features.append(feature)
else:
raise WekapyException("Argument 'feature' must be of type Feature.")
# Filter class
#
# Used to filter/preprocess data using one of the weka.filters classes.
# Used to filter/pre-process data using one of the weka.filters classes.
class Filter:
def __init__(self, max_memory=1500, verbose=True):
def __init__(self, max_memory=1500, classpath=None, verbose=False):
if not isinstance(max_memory, int):
raise WekapyException("'max_memory' argument must be of type (int).")
return False
self.classpath = classpath
self.max_memory = max_memory
self.id = uuid.uuid4()
self.verbose = verbose
@ -95,95 +110,113 @@ class Filter:
def filter(self, filter_options=None, input_file_name=None, output_file=None, class_column="last"):
if filter_options is None:
raise WekapyException("A filter type is required")
return False
if input_file_name is None:
raise WekapyException("An input file is needed for filtering")
return False
if output_file is None:
output_file = "%s-filtered.arff" % (input_file_name.rstrip(".arff"))
if self.verbose: print "Filtering input data..."
options = ["java", "-Xmx"+str(self.max_memory)+"M"]
output_file = "{}-filtered.arff".format(str(input_file_name.rstrip(".arff")))
if self.verbose:
print("Filtering input data...")
options = ["java", "-Xmx{}M".format(str(self.max_memory))]
if self.classpath is not None:
options.extend(["-cp", self.classpath])
options.extend(filter_options)
options.extend(["-i", input_file_name, "-o", output_file, "-c", class_column])
process_output, run_time = run_process(options)
if self.verbose: print "Filtering complete (time taken = %.2fs)." % (run_time)
if self.verbose:
print("Filtering complete (time taken = {:.2f}s)".format(run_time))
return output_file
def split(self, input_file_name=None, training_percentage=67, randomise=True, seed=None):
if input_file_name is None:
raise WekapyException("An input file is needed for filtering")
return False
if not isinstance(training_percentage, int):
raise WekapyException("'training_percentage' argument must be of type (int).")
return False
if randomise == True and seed is None:
seed = random.randint(0,1000)
if randomise == True:
if self.verbose: print "Randomising data order..."
output_file = "%s-randomised.arff" % (input_file_name.rstrip(".arff"))
process_output, run_time = run_process(["java", "-Xmx"+str(self.max_memory)+"M", "weka.filters.unsupervised.instance.Randomize", "-S", str(seed), "-i", input_file_name, "-o", output_file])
options = ["java", "-Xmx{}M".format(str(self.max_memory))]
if self.classpath is not None:
options.extend(["-cp", self.classpath])
if randomise is True and seed is None:
seed = random.randint(0, 1000)
if randomise is True:
if self.verbose:
print("Randomising data order...")
output_file = "{}-randomised.arff".format(str(input_file_name.rstrip(".arff")))
options.extend(
["weka.filters.unsupervised.instance.Randomize", "-S", str(seed), "-i", input_file_name, "-o",
output_file])
process_output, run_time = run_process(options)
input_file_name = output_file
if self.verbose: print "Randomisation complete (time taken = %.2fs)." % (run_time)
if self.verbose: print "Beginning split...\nCreating training set..."
output_file = "%s-training.arff" % (input_file_name.rstrip(".arff"))
process_output, run_time_training = run_process(["java", "-Xmx"+str(self.max_memory)+"M", "weka.filters.unsupervised.instance.RemovePercentage", "-P", str(training_percentage), "-V", "-i", input_file_name, "-o", output_file])
if self.verbose: print "Creating testing set..."
output_file = "%s-testing.arff" % (input_file_name.rstrip(".arff"))
process_output, run_time_testing = run_process(["java", "-Xmx"+str(self.max_memory)+"M", "weka.filters.unsupervised.instance.RemovePercentage", "-P", str(training_percentage), "-i", input_file_name, "-o", output_file])
if self.verbose: print "Split complete (time taken = %.2fs)." % (run_time_training+run_time_testing)
if self.verbose:
print("Randomisation complete (time taken = {:.2f}s).".format(run_time))
if self.verbose:
print("Beginning split...\nCreating training set...")
output_file = "{}-training.arff".format(str(input_file_name.rstrip(".arff")))
options.extend(
["weka.filters.unsupervised.instance.RemovePercentage", "-P", str(training_percentage), "-V", "-i",
input_file_name, "-o", output_file])
process_output, run_time_training = run_process(options)
if self.verbose:
print("Creating testing set...")
output_file = "{}-testing.arff".format(str(input_file_name.rstrip(".arff")))
options.extend(["weka.filters.unsupervised.instance.RemovePercentage", "-P", str(training_percentage), "-i",
input_file_name, "-o", output_file])
process_output, run_time_testing = run_process(options)
if self.verbose:
print("Split complete (time taken = {:.2f}s).".format(run_time_training + run_time_testing))
# Model class
#
# Used externally, and is the main class for use with this library.
# The Model class should be instantiated as the first stage, from which it can be trained
# The Model class should be instantiated as the first stage, from which it can be trained
# and/or tested.
# Instantiate with a classifier_type (and any optional arguments)
class Model:
def __init__(self, classifier_type = None, max_memory = 1500, verbose = True):
if classifier_type == None or not isinstance(classifier_type, str):
def __init__(self, classifier_type=None, max_memory=1500, classpath=None, verbose=False):
if classifier_type is None or not isinstance(classifier_type, str):
raise WekapyException("A classifier type is required for construction.")
return False
if not isinstance(max_memory, int):
raise WekapyException("'max_memory' argument must be of type (int).")
return False
self.id = uuid.uuid4()
self.model_dir = "wekapy_data/models"
self.arff_dir = "wekapy_data/arff"
self.classpath = classpath
self.classifier = classifier_type
self.max_memory = max_memory
self.training_instances = []
self.testing_instances = []
self.predictions = []
self.time_taken = 0.0
self.verbose = verbose
self.trained = False
self.model_file = None
self.training_file = None
self.test_file = None
if not os.path.exists(self.model_dir):
os.makedirs(self.model_dir)
os.makedirs(self.model_dir)
if not os.path.exists(self.arff_dir):
os.makedirs(self.arff_dir)
os.makedirs(self.arff_dir)
# Generate an ARFF file from a list of instances
def create_ARFF(self, instances, type):
output_arff = open(os.path.join(self.arff_dir, "%s-%s.arff" %(str(self.id), type)), "w")
output_arff.write("@relation "+str(self.id)+"\n")
def create_arff(self, instances, data_type):
output_arff = open(os.path.join(self.arff_dir, "{}-{}.arff".format(str(self.id), data_type)), "w")
output_arff.write("@relation " + str(self.id) + "\n")
for i, instance in enumerate(instances):
if i == 0:
for feature in instance.features:
output_arff.write("\t@attribute "+feature.name+" "+str(feature.possible_values)+"\n")
output_arff.write("\t@attribute " + feature.name + " " + str(feature.possible_values) + "\n")
output_arff.write("\n@data\n")
strToWrite = ""
str_to_write = ""
for j, feature in enumerate(instance.features):
if j == 0:
strToWrite = strToWrite + str(feature.value)
str_to_write = str_to_write + str(feature.value)
else:
strToWrite = strToWrite + "," + str(feature.value)
output_arff.write(strToWrite+"\n")
str_to_write = str_to_write + "," + str(feature.value)
output_arff.write(str_to_write + "\n")
output_arff.close()
if type == "training":
self.training_file = self.arff_dir+"/"+str(self.id)+"-"+type+".arff"
if type == "test":
self.test_file = self.arff_dir+"/"+str(self.id)+"-"+type+".arff"
if data_type == "training":
self.training_file = self.arff_dir + "/" + str(self.id) + "-" + data_type + ".arff"
if data_type == "test":
self.test_file = self.arff_dir + "/" + str(self.id) + "-" + data_type + ".arff"
# Load a model, if it exists, and set this as the currently trained model for this
# Model instance.
@ -197,68 +230,81 @@ class Model:
# Add a training instance to the model.
def add_train_instance(self, instance):
if isinstance(instance, Instance):
self.training_instances.append(instance)
self.training_instances.append(instance)
else:
raise WekapyException("Argument 'instance' must be of type Instance.")
# Add a testing instance to the model.
def add_test_instance(self, instance):
if isinstance(instance, Instance):
self.testing_instances.append(instance)
self.testing_instances.append(instance)
else:
raise WekapyException("Argument 'instance' must be of type Instance.")
# Train the model with the chosen classifier from features in an ARFF file
def train(self, training_file = None, instances = None, save_as = None, folds = 10):
if self.verbose: print "Training your classifier..."
if save_as == None:
save_as = self.model_dir+"/"+str(self.id)+".model"
if len(self.training_instances) == 0: # if add_train_instance not called:
if training_file == None and instances == None:
raise WekapyException("Please provide some train instances either by naming an ARFF train_set, providing a list of Instances, or calling add_train_instance().")
if training_file == None:
self.create_ARFF(instances, "training")
if instances == None:
def train(self, training_file=None, instances=None, save_as=None, folds=10):
if self.verbose:
print("Training your classifier...")
if save_as is None:
save_as = self.model_dir + "/" + str(self.id) + ".model"
if len(self.training_instances) == 0: # if add_train_instance not called:
if training_file is None and instances is None:
raise WekapyException(
"Please provide some train instances either by naming an ARFF train_set, providing a list of Instances, or calling add_train_instance().")
if training_file is None:
self.create_arff(instances, "training")
if instances is None:
self.training_file = training_file
if len(self.training_instances) > 0: # if add_train_instance called:
if training_file == None and instances == None:
self.create_ARFF(self.training_instances, "training")
# Prioritise adding fetures passed at calltime
if training_file == None and instances is not None:
self.create_ARFF(instances, "training")
if len(self.training_instances) > 0: # if add_train_instance called:
if training_file is None and instances is None:
self.create_arff(self.training_instances, "training")
# Prioritise adding features passed at call time
if training_file is None and instances is not None:
self.create_arff(instances, "training")
# Prioritise ARFF file passed at calltime
if instances == None and training_file is not None:
if instances is None and training_file is not None:
self.training_file = training_file
self.model_file = save_as
process_output, run_time = run_process(["java", "-Xmx"+str(self.max_memory)+"M", "weka.classifiers."+self.classifier, "-x", str(folds),"-t", self.training_file, "-d", save_as])
options = ["java", "-Xmx{}M".format(str(self.max_memory))]
if self.classpath is not None:
options.extend(["-cp", self.classpath])
options.extend(
["weka.classifiers." + self.classifier, "-x", str(folds), "-t", self.training_file, "-d", save_as])
process_output, self.time_taken = run_process(options)
self.trained = True
if self.verbose: print "Training complete (time taken = %.2fs)." % (run_time)
if self.verbose:
print("Training complete (time taken = {:.2f}s).".format(self.time_taken))
# Generate predictions from the trained model from test features in an ARFF file
def test(self, test_file = None, instances = None, model_file = None):
if self.verbose: print "Generating predictions for your test set..."
start_time = time.time()
if not model_file == None:
def test(self, test_file=None, instances=None, model_file=None):
if self.verbose:
print("Generating predictions for your test set...")
if model_file is not None:
self.load_model(model_file)
if not self.trained:
raise WekapyException("The classifier has not yet been trained. Please call train() first")
if len(self.testing_instances) == 0:
if test_file == None and instances == None:
raise WekapyException("Please provide some test instances either by naming an ARFF test_set, providing a list of Instances, or calling add_test_instance().")
if test_file == None:
self.create_ARFF(instances, "test")
if instances == None:
if test_file is None and instances is None:
raise WekapyException(
"Please provide some test instances either by naming an ARFF test_set, providing a list of Instances, or calling add_test_instance().")
if test_file is None:
self.create_arff(instances, "test")
if instances is None:
self.test_file = test_file
if len(self.testing_instances) > 0:
if test_file == None and instances == None:
self.create_ARFF(self.testing_instances, "test")
if test_file == None and instances is not None:
self.create_ARFF(instances, "test")
if instances == None and test_file is not None:
if test_file is None and instances is None:
self.create_arff(self.testing_instances, "test")
if test_file is None and instances is not None:
self.create_arff(instances, "test")
if instances is None and test_file is not None:
self.test_file = test_file
process_output, run_time = run_process(["java", "-Xmx"+str(self.max_memory)+"M", "weka.classifiers."+self.classifier, "-T", self.test_file, "-l", self.model_file, "-p", "0"])
options = ["java", "-Xmx{}M".format(str(self.max_memory))]
if self.classpath is not None:
options.extend(["-cp", self.classpath])
options.extend(["weka.classifiers." + self.classifier, "-T", self.test_file, "-l", self.model_file, "-p", "0"])
process_output, self.time_taken = run_process(options)
lines = process_output.split("\n")
instance_predictions = []
@ -271,22 +317,22 @@ class Model:
p_cat = int((pred[2].split(":"))[0])
p_val = str((pred[2].split(":"))[1])
error = False
prob = 0.0
if "+" in pred[3]:
error = True
prob = float(pred[4])
else:
prob = float(pred[3])
prediction = Prediction(index, ob_cat, ob_val, p_cat, p_val, error, prob)
instance_predictions.append(prediction)
instance_predictions.append(prediction)
self.predictions = instance_predictions
end_time = time.time()
if self.verbose: print "Testing complete (time taken = %.2fs)." % (end_time-start_time)
if self.verbose:
print("Testing complete (time taken = {:.2f}s).".format(self.time_taken))
return instance_predictions
class WekapyException(Exception):
def __init__(self, message):
self.message = message
def __str__(self):
return self.message
Loading…
Cancel
Save