kronosapiens.github.io - Designing to be Subclassed









Search Preview

Designing to be Subclassed

kronosapiens.github.io
We’ve reached an interesting stage of the development of the first ParagonMeasure product. As a mobile health telemonitoring tool with immediate research app...
.io > kronosapiens.github.io

SEO audit: Content analysis

Language Error! No language localisation is found.
Title Designing to be Subclassed
Text / HTML ratio 43 %
Frame Excellent! The website does not use iFrame solutions.
Flash Excellent! The website does not have any flash contents.
Keywords cloud = methods method data def class row library return parsed_row override typing __init__ API functionality subject_id web DataFrame single
Keywords consistency
Keyword Content Title Description Headings
= 62
methods 28
method 25
data 24
def 22
class 21
Headings
H1 H2 H3 H4 H5 H6
2 5 2 1 0 0
Images We found 0 images on this web page.

SEO Keywords (Single)

Keyword Occurrence Density
= 62 3.10 %
methods 28 1.40 %
method 25 1.25 %
data 24 1.20 %
def 22 1.10 %
class 21 1.05 %
row 19 0.95 %
library 16 0.80 %
return 15 0.75 %
parsed_row 14 0.70 %
override 11 0.55 %
typing 8 0.40 %
__init__ 8 0.40 %
API 8 0.40 %
functionality 8 0.40 %
subject_id 8 0.40 %
web 8 0.40 %
8 0.40 %
DataFrame 8 0.40 %
single 7 0.35 %

SEO Keywords (Two Word)

Keyword Occurrence Density
of the 24 1.20 %
in a 9 0.45 %
in the 9 0.45 %
typing data 8 0.40 %
have to 8 0.40 %
from the 7 0.35 %
to be 7 0.35 %
the parent 7 0.35 %
a single 7 0.35 %
like this 7 0.35 %
and the 6 0.30 %
want to 6 0.30 %
needed to 6 0.30 %
at the 6 0.30 %
is not 6 0.30 %
need to 6 0.30 %
that I 5 0.25 %
parse_rowself row 5 0.25 %
def parse_rowself 5 0.25 %
I needed 5 0.25 %

SEO Keywords (Three Word)

Keyword Occurrence Density Possible Spam
all typing data 5 0.25 % No
I needed to 5 0.25 % No
a single row 5 0.25 % No
in a single 5 0.25 % No
data in a 5 0.25 % No
typing data in 5 0.25 % No
Parse all typing 5 0.25 % No
row Parse all 5 0.25 % No
parse_rowself row Parse 5 0.25 % No
def parse_rowself row 5 0.25 % No
the parent class 5 0.25 % No
is not None 4 0.20 % No
These are the 4 0.20 % No
parsed_row is not 4 0.20 % No
meant that I 3 0.15 % No
a number of 3 0.15 % No
the top of 3 0.15 % No
Methods These are 3 0.15 % No
of typing data 3 0.15 % No
at the top 3 0.15 % No

SEO Keywords (Four Word)

Keyword Occurrence Density Possible Spam
in a single row 5 0.25 % No
def parse_rowself row Parse 5 0.25 % No
parse_rowself row Parse all 5 0.25 % No
row Parse all typing 5 0.25 % No
Parse all typing data 5 0.25 % No
all typing data in 5 0.25 % No
typing data in a 5 0.25 % No
data in a single 5 0.25 % No
parsed_row is not None 4 0.20 % No
class Deviceobject def __init__self 3 0.15 % No
selfdevice_dir = 'datadevices' selfpath 3 0.15 % No
if parsed_row is not 3 0.15 % No
at the top of 3 0.15 % No
'datadevices' selfpath = selfdevice_dir 3 0.15 % No
like this n More 3 0.15 % No
this n More imports 3 0.15 % No
= 'datadevices' selfpath = 3 0.15 % No
import re Python Regular 2 0.10 % No
re Python Regular Expression 2 0.10 % No
data def parse_rowself row 2 0.10 % No

Internal links in - kronosapiens.github.io

About
About
Strange Loops and Blockchains
Strange Loops and Blockchains
Trie, Merkle, Patricia: A Blockchain Story
Trie, Merkle, Patricia: A Blockchain Story
Reputation Systems: Promise and Peril
Reputation Systems: Promise and Peril
The Future of Housing, in Three Parts
The Future of Housing, in Three Parts
Proof of Work vs Proof of Stake: a Mirror of History
Proof of Work vs Proof of Stake: a Mirror of History
Introducing Talmud
Introducing Talmud
The Economics of Urban Farming
The Economics of Urban Farming
Time and Authority
Time and Authority
On Meaning in Games
On Meaning in Games
Objective Functions in Machine Learning
Objective Functions in Machine Learning
A Basic Computing Curriculum
A Basic Computing Curriculum
The Problem of Information II
The Problem of Information II
The Problem of Information
The Problem of Information
Elements of Modern Computing
Elements of Modern Computing
Blockchain as Talmud
Blockchain as Talmud
Understanding Variational Inference
Understanding Variational Inference
OpsWorks, Flask, and Chef
OpsWorks, Flask, and Chef
On Learning Some Math
On Learning Some Math
Understanding Unix Permissions
Understanding Unix Permissions
30 Feet from Michael Bloomberg
30 Feet from Michael Bloomberg
The Academy: A Machine Learning Framework
The Academy: A Machine Learning Framework
Setting up a queue service: Django, RabbitMQ, Celery on AWS
Setting up a queue service: Django, RabbitMQ, Celery on AWS
Versioning and Orthogonality in an API
Versioning and Orthogonality in an API
Designing to be Subclassed
Designing to be Subclassed
Understanding Contexts in Flask
Understanding Contexts in Flask
Setting up Unit Tests with Flask, SQLAlchemy, and Postgres
Setting up Unit Tests with Flask, SQLAlchemy, and Postgres
Understanding Package Imports in Python
Understanding Package Imports in Python
Setting up Virtual Environments in Python
Setting up Virtual Environments in Python
Creating superfunctions in Python
Creating superfunctions in Python
Some Recent Adventures
Some Recent Adventures
Sorting in pandas
Sorting in pandas
Mimicking DCI through Integration Tests
Mimicking DCI through Integration Tests
From Ruby to Python
From Ruby to Python
Self-Focus vs. Collaboration in a Programming School
Self-Focus vs. Collaboration in a Programming School
Designing Software to Influence Behavior
Designing Software to Influence Behavior
Maintaining Octopress themes as git submodules
Maintaining Octopress themes as git submodules
Setting up a test suite with FactoryGirl and Faker
Setting up a test suite with FactoryGirl and Faker
To Unit Test or not to Unit Test
To Unit Test or not to Unit Test
A Dynamic and Generally Efficient Front-End Filtering Algorithm
A Dynamic and Generally Efficient Front-End Filtering Algorithm
Trails & Ways: A Look at Rails Routing
Trails & Ways: A Look at Rails Routing
Getting Cozy with rspec_helper
Getting Cozy with rspec_helper
Exploring the ActiveRecord Metaphor
Exploring the ActiveRecord Metaphor
Civic Hacking as Inspiration
Civic Hacking as Inspiration
From Scheme to Ruby
From Scheme to Ruby
Setting up Auto-Indent in Sublime Text 2
Setting up Auto-Indent in Sublime Text 2
hello world
hello world
via RSS
Abacus

Kronosapiens.github.io Spined HTML


Designing to be Subclassed AbacusWell-nighDesigning to be Subclassed Aug 15, 2014 We’ve reached an interesting stage of the minutiae of the first ParagonMeasure product. As a mobile health telemonitoring tool with firsthand research applications, I’ve been asked to take the wringer library which I wrote over much of June and July and build it into a web application. This presents a number of heady challenges; the biggest stuff the rencontre of taking a tool meant to run locally and requite it the brains to run online. Other facets of the rencontre (dealing with environments, packaging, databases, caching, and the API) are fascinating, and dealt with in other posts. This post is going to be well-nigh the rencontre of adapting classes and models to meet new requirements – in a word, well-nigh subclassing. When architecting the web app, I made the undeniability to preserve the original wringer library as it was, preserving all of the local functionality so that future researchers and developers could install the library to manage data and run wringer locally. This meant that rather than waffly the original library to meet the new requirements, adapting the library for use on the web would require me to subclass the original classes and override all of the filesystem I/O methods and replace them with methods which interact with web APIs and noSQL databases. As Sandi Metz repeatedly emphasizes in her spanking-new book, good software engineering is future-proof; good diamond is designed to be changed. While going through the process of subclassing my own library, I saw first-hand why weightier practices exist the way that they do. Several innocent decisions made in the initial library proved to be hindrances when it came time to proffer functionality; getting to solve those problems from both sides (having full tenancy over both the library and the web app) meant that I got to see the interaction first-hand. Without remoter ado, here are some insights: Declaring all instance varaibles in __init__ This first point is very much the low-hanging fruit of this post. They say that you’ll be shocked by how much you forget well-nigh the lawmaking you write, weeks or plane days without the fact. “What did I midpoint by that?” is a worldwide refrain. They say that the weightier lawmaking is written for humans, not machines, and as such should prioritize clarity whilom all. What this ways in the context of this discussion is that you should set all your instance variables when you initialize an object. This ways two things: First, initialize any placeholder values, plane if it’s None or [] or {}.Planeif you’re not planning on using it until later, pinpoint it in __init__ so that you’ll remember it exists. Here’s an antipattern: you may think you won’t need variable X until method C, so you go superiority and initialize it for the first time at the whence of C. But say a user subclasses your object and decides they want to use X in method B. But now they’re getting an exception and they need to icon out why. Now they have to both set variable X at the top of their method B, as well as override method C to remove the setting of X there (which would delete their version from method B). Further, if they overly decide that they unquestionably want to run method C surpassing method B, for whatever reason, they’ll have to go when and add some sort of provisionary in method B so they don’t override the value of X from method C. This is all a huge and unecessary headache. Initialize your variables once, in __init__, so they’ll be there when you need them, and you won’t have to struggle remembering what order you expected your users to want to do things. Requirements change, emergent policies emerges. Don’t misplace yourself and shackle yourself to a single use-case by obscuring what variables your objects depend on. Second, don’t write functions which run in __init__ and set instance variables internally to themselves. You’ll misplace the hell out of everyone, as people will have to remember which variables are set by which, since they can’t tell by looking at __init__. Take a cue from the functional programmers and have those functions return their values and set those return values explicitly in __init__. Consider the pursuit example: } matriculation Device(object): def __init__(self, subject_id): self.subject_id = subject_id self.device_dir = 'data/devices/' self.path = self.device_dir + subject_id + '_device.csv' self.create_device() Things are unquestionably looking fine until we get to that last line – self.create_device() looks like it’s doing some heavy duty lifting, but we can’t tell by looking at __init__. We have to go lanugo to where create_device is specified to see what variables it’s setting: def create_device(self): """Create the device dataframe""" self.get_device_dimensions() try: raw_device = pd.read_csv(self.path, sep='<>', skiprows=[0,1]) except: raw_device = pd.read_csv(self.device_dir+'default_device.csv', sep='<>', skiprows=[0,1]) raw_device = self.add_key_data(raw_device) self.data = raw_device.set_index('key', drop=False) So it seems like it’s setting self.data. It would’ve been much clearer if create_device had just returned raw_device.set_index('key, drop=False), and our __init__ looked increasingly like this: } matriculation Device(object): def __init__(self, subject_id): self.subject_id = subject_id self.device_dir = 'data/devices/' self.path = self.device_dir + subject_id + '_device.csv' self.data = self.create_device() This is much clearer for both some other developer trying to subclass your library, as well as for future you, an important but often ignored part of your life. For those of you looking closely, you may have noticed some bonus weirdness. Go when and squint at the first line of create_device(): def create_device(self): """Create the device dataframe""" self.get_device_dimensions() What is this? get_device_dimensions()? What does that do? What does that return? Now I have to go read some increasingly source code? What kind of terrible developer are you? Let’s peek: def get_device_dimensions(self): """Get the device dimensions in pixels and millimeters""" try: f = open(self.path) except: f = open(self.device_dir+'default_device.csv') self.px_size = f.readline().rstrip().split(' ')[-1] self.mm_size = f.readline().rstrip().split(' ')[-1] You’re kidding me. You just set two increasingly instance variables and said nothing. There should be a fine for this sort of malfeasance. Let’s take a few breaths and squint at the refactored new hotness: } matriculation Device(object): def __init__(self, participant_id): self.participant_id = participant_id self.device_dir = 'data/devices/' self.path = self.device_dir + participant_id + '_device.csv' self.px_size, self.mm_size = self.get_device_dimensions() self.keys = self.create_device() Such clarity. Such ease-of-subclassing. Such justice. Using Constants One of the first modules I wrote in the library was session_parser.py, defining the SessionParser matriculation capable of parsing CSV files and creating pandas DataFrames based on their contents. The parser would go row-by-row through the CSV, pull out some of the fields, convert them to dictionaries, and then do remoter work on the values in the dictionaries. This meant that SessionParser needed to store some knowledge of the structure of the CSV and the interior dictionaries – specifically, knowledge well-nigh the keys. My original session_parser.py looked something like this: n %} ... #Increasinglyimports import re # Python Regular Expression library matriculation SessionParser(object): """Object which can parse CSVs of typing data.""" ... def parse_row(self, row): """Parse all typing data in a single row.""" self.current_subject = row['user:id'] self.current_device = self.load_device(self.current_subject) self.current_submit_time_str = row['context:timestamp'] current_submit_time = pd.to_datetime(self.current_submit_time_str) # Creates a pandas.tslib.Timestamp ... #Increasinglystuff This was fine for a while – all my data was coming from the same source, so I naively hard-coded all of the keys right into the methods. Things reverted once we went online, though. Data was coming from a web service via an API, not from CSVs stored locally. The data was mostly the same, but there were a number of small differences in institute and structure… including, of course, in the CSV post names and wordlist keys. I needed to find some way for the library running locally to use one set of keys, with the subclassed parser using another. The answer, of course, was to pericope the hard-coded keys out of the methods and store them as CONSTANTS at the top of the file, with the lawmaking itself just referencing the constants. This is a weightier practice, I think, for two reasons: The keys are stored in one place.Wafflythe key ways waffly the value of the unvarying once, versus waffly it in multiple places all over the code. Subclassed parsers can redefine the constants without having to transpiration any of the very method logic. My new session_parser.py looks increasingly like this: n %} ... #Increasinglyimports import re # Python Regular Expression library USER_ID_KEY = 'user:id' TIMESTAMP_KEY = 'context:timestamp' SESSION_KEY = 'finemotortest' matriculation SessionParser(object): """Object which can parse CSVs of typing data.""" ... def parse_row(self, row): """Parse all typing data in a single row.""" self.current_participant = row[USER_ID_KEY] self.current_device = self.load_device(self.current_participant) self.current_submit_time = pd.to_datetime(row[TIMESTAMP_KEY]) # Creates a pandas.tslib.Timestamp ... #Increasinglystuff Meanwhile, the subclassed version, OhmageParser, looks like this: n %} ... #Increasinglyimports from webapp import db # Ohmage API query parameters USER_ID_KEY = 'urn:ohmage:user:id' SESSION_ID_KEY = 'urn:ohmage:survey_response:id' TIMESTAMP_KEY = 'urn:ohmage:context:timestamp' RESPONSE_KEY = 'urn:ohmage:prompt:response' CONTEXT_KEY = 'urn:ohmage:context:launch_context_long' # Not part of the API call, but in the response: SESSION_KEY = 'urn:ohmage:prompt:id:finemotortest' matriculation OhmageParser(SessionParser): """Object wich can parse JSONs of typing data from the Ohmage API""" ... def parse_row(self, row): """Parse all typing data in a single row.""" self.current_session = self.create_session(row) self.current_participant = self.current_session.participant.name self.current_device = self.current_session.device # Assignments to satisfy the conventions of the parent class. self.current_submit_time = pd.to_datetime(row[TIMESTAMP_KEY]) ... def create_session(self, row): s = Session(row[SESSION_ID_KEY], pd.to_datetime(row[TIMESTAMP_KEY])) s.participant = self.create_or_load_participant(row) s.device = self.create_or_load_device(row) s.save() return s def create_or_load_participant(self, row): p = Participant.query.filter_by(name=row[USER_ID_KEY]).first() if not p: p = Participant(row[USER_ID_KEY]) p.save() return p You’ll notice a number of things here. The first is that I’ve redefined the keys by waffly the constants at the top of the module. This ways that methods specified in session_parser.py can run in the subclass of OhmageParser without any problem. The second is that I’ve unquestionably overridden the parse_row() method. I’ve washed-up this considering parse_row() implements some I/O functionality that required some increasingly heavy-duty customization. This is the subject of the next session. Interface methods, public methods, internal methods This last section is increasingly conceptual, and has to do with how your organize the functionality in your classes. When I was subclassing my own library, I wanted to override as few methods as possible. Every override doubles the value of lawmaking that needs to be maintained (since any big changes in the method need to be reflected on both the parent and child definitions), and feels like a personal diamond failure (at least for me). Further, an override is a signal that the child matriculation has variegated needs from the parent matriculation – a few of these may be necessary, but too many may suggest that you haven’t thought through inheritence structure enough. While going through this subclassing process, I discovered that there was a unrepealable elegant and emergent clustering of methods into three tags: Interface Methods These are the methods most towardly for subclassing. These are the Input/Output methods, which take input from external sources and convert them into formats which the matriculation can work with internally, a bit like this: input_method(outside_data) » something the matriculation understands » output_data(internal_data) » something the outside world understands. In my case, moving from local to the web meant that I need to override the methods which took data from the filesystem with methods which could pull data from a web service’s API. Further, I needed to be worldly-wise to write data to a database, instead of to local files. By organizing this I/O functionality into methods separate from the cadre public and internal methods, I could override just those methods to get my classes working in their new environments. For example, consider my save_dataframe() method from SessionParser (Note: I still need to implement functionality for saving two kinds of data – something I’ve washed-up on the new version but not the original): def save_dataframe(self, parsed_row): """Save (create or update) participant data as appropriate.""" # NOTE: NOT IMPLEMENTED DUAL TASK/MOTION DATAFRAMES dataframe = parsed_row['task_dataframe'] participant_id = parsed_row['participant_id'] try: prior_dataframe = pd.read_pickle(self.storage_dir + participant_id) output_dataframe = pd.concat([prior_dataframe, dataframe]) except: output_dataframe = dataframe finally: (output_dataframe.sort(columns=['SubmitTime', 'Task', 'TouchTime']). to_pickle(self.storage_dir + participant_id)) # Saves a pandas DataFrame locally as a 'pickle' Compare with the overriden method in OhmageParser, the child class: def save_dataframe(self, parsed_row): """Write new data to the database.""" if parsed_row['task_dataframe'] is not None: self.save_typing_data(task_dataframe) if parsed_row['motion_dataframe'] is not None: self.save_motion_data(parse_row) def save_typing_data(self, task_dataframe): task_dataframe['session_id'] = self.current_session.mongo_id task_dataframe['participant_id'] = self.current_session.participant.mongo_id typing_data_list = task_dataframe.to_dict(outtype='records') db.session.db.TypingData.insert(typing_data_list) # Saves the data as a series of Documents in a MongoDB database. def save_motion_data(self, motion_dataframe): motion_dataframe['session_id'] = self.current_session.mongo_id motion_data_list = motion_dataframe.to_dict(outtype='records') db.session.db.MotionData.insert(motion_data_list) You’ll note how the same method signature (save_dataframe(self, parsed_row)) ways that the cadre methods inherited from the parent matriculation don’t need to know well-nigh the implementation of this method – the child can override the method to deal with new storage requirements, but as long as it keeps the function signature the same, inherited methods will work just fine. Internal Methods These are methods which the matriculation uses internally for various tasks, but are often not tabbed by the user directly. I found that I would occasionally need to override these to worth for the new environment I was in, plane though these weren’t pure I/O methods. One example was SessionParser’s is_already_parsed() method. In the parent class, this method would reference a wordlist of already-seen dataframes to establish whether or not a new row had been parsed. The child class, OhmageParser, since it was pulling data from an API and could constrain the query by dates, could ensure that it was seeing only new data by specifying a recent time period. This method was tabbed by the parse_all() method, which is the principal public method of the parent matriculation (and the one that I definitely didn’t want to have to override), which meant that I needed to do something like this: def is_already_parsed(self, row): """Unecessary b/c Ohmage can be queried w/ stage range.""" return False By overriding (and substantially neutralizing) an internal method, without waffly it’s method signature, I was worldly-wise transmute the policies of the parent class’s methods without overriding them. Public Methods These are the good-looking, outward facing prom kings and queens of your class. These are the methods that people will come to know and love. These should be as powerful and generic as possible. Most importantly, their inputs and outputs should rarely, if overly change. Well, maybe not never. But very rarely. These public methods are what form the API, the interface, between your classes and the rest of the universe. They’re how other people will learn to interact with your class. When people start using your library as part of their project, these methods are what they will add to their code. This ways that if you screw virtually with these methods, everyone who is using your library will have to transpiration their code. To stave that, these methods should be quite general-purpose, with lots of optional arguments and flags so people can tweak policies while still working within the boundaries that the method defines. For inputs, this ways using lots of optional, keyword arguments with default values. This way, you can add functionality to your matriculation without breaking backwards compatibility. If you can come up with a sensible default for any new parameter, you can expand the functionality of your public methods without breaking backward-compatibility. That’s inputs. What well-nigh outputs? This brings us nicely to the last point: Dictionaries are theWeightierReturn Value They’re such a good return value it’s scrutinizingly too nonflexible to believe. Think well-nigh it this way: if you return a dict, then the recieving function knows to wangle whatever value they want by using the correct key. As long as you don’t transpiration the name of the dict or the name of the key then you can add literally anything you want to that wordlist without breaking backwards-compatibility. Here’s a good example. Consider the old, wracked version: def convert_raw_session_to_dataframe(self, raw_session): """Convert a JSON of fine motor test data into a DataFrame.""" ... #Well-nigha dozen lines of pure radiance if task_dataframes: return pd.concat(task_dataframes) else: return False Ok, so thesping my task_dataframes list isn’t empty (which happens, live data can be treacherous), I squeeze it all into a single DataFrame and return that sucker. The result gets recieved like this: def parse_row(self, row): """Parse all typing data in a single row.""" ... #Increasinglybrilliance session_dataframe = self.convert_session_fmt_to_dataframe(raw_session) if session_dataframe is not False: session_dataframe['SubmitTime'] = current_submit_time session_dataframe = session_dataframe.set_index(['SubmitTime'], drop=False) return session_dataframe, self.current_subject Note that parse_row expects convert_raw_session_to_dataframe to return one DataFrame. It then does some remoter work and returns a tuple of the DataFrame and the current_subject. Going up one increasingly level, let’s see how it comes together: def parse_file(self, check_parsed=True): """Parse each row in a CSV file""" fmt_dataframe, subject_id = self.parse_row(row) if fmt_dataframe is not False: self.save_dataframe(fmt_dataframe, subject_id) Ok, so parse_file expects to get a tuple when from parse_row. Of course, this unshortened diamond is stupid. Why?Consideringthe minute I needed to add something – say, for example, a second DataFrame, the unshortened thing fell apart. I needed to transpiration every undeniability to convert_raw_session_to_dataframe to expect a tuple of DataFrames, and every undeniability to parse_row to requite when a tuple of two DataFrames and a user. Ridiculous. Fortunately, a largest solution presented itself immediately: def convert_raw_session_to_dataframes(self, session_dict): """Convert a dict of session data into a multi-task DataFrame.""" ... # Nobel-prize winning algorithmic radiance return {'task_dataframe': task_dataframe, 'motion_dataframe': motion_dataframe} Gosh, that was easy. Let’s go up a level: def parse_row(self, row): """Parse all typing data in a single row.""" ... # Pure poetry parsed_row = self.convert_raw_session_to_dataframes(session_dict) parsed_row['participant_id'] = self.current_participant return parsed_row Oh, neat. I want to add something to the return value? Just toss that sucker in the dict. def parse_all(self, check_parsed=True): """Parse each row in a pandas DataFrame."""" ... # I think I've used this joke up parsed_row = self.parse_row(row) if (parsed_row['task_dataframe'] is not None or parsed_row['motion_dataframe'] is not None): self.save_dataframe(parsed_row) Wow. So you’re saying that I can now add anything I want to this parsed_row wordlist to meet virtually any new requirement without having to make any changes to existing code? Neat. Comments Please enable JavaScript to view the comments powered by Disqus. Abacus Abacus kronovet@gmail.com kronosapiens kronosapiens I'm Daniel Kronovet, a data scientist living in Tel Aviv.