- The fundamental operation required by the challenge is to extract aggregate information from a
prescriptiontable, using itsdrug_namefield - If I were to implement it without the challenge restrictions, using e.g. pandas, this would mean a
groupbyoperation on thedrug_namecolumn, followed by anaggregateoperation, usingsumondrug_costandnuniqueonprescriber_id - Another Python solution, similar to the previous one but without using pandas, would be to read the CSV file to a SQLite database table, and then issuing a
GROUP BYquery using theSUMandCOUNT DISTINCTaggregation functions.
- I used a row-based paradigm, supported by classes (
PrescriptionandDrugSummary) to represent the fundamental domain data units. The main advantages of this approach:- Easy to read, and naturally domain-aligned: natural correspondence input line <-> table record <-> Python object
- Easy to write: CSV files in Python are read and written line-by-line, and record instances can naturally be serialized to/from
tuples - Numpy and pandas are primarily column-oriented, among other reasons, for performance. Implementing a column-based framework for the purpose of the challenge did not seem to be a good idea, especially for the purpose of this challenge (since no explicit performance targets were given), but also in general (reinventing the wheel, unless of course there were particular requirements to do so)
- Since the problem presents a natural high degree of coupling between the data (the values in the record fields) and operations on the data, classes were used instead of a non-OO approach using plain
dictinstances and free functions
- Its instances represent records in the input table
- I exploited the fact that, once converted, its data attributes are immutable for its implementation as a
namedtuplefrom thecollectionsmodule in the Python standard library (plus additional functionality)
- Its instances represent records in the output table
- A
namedtuplecould not be used in this case, because the data attributes are mutated after instantiation. I used the__slots__class attribute to freeze the data attributes, for the purpose of explicitly enforcing constraints on the API, and helping reducing memory usage for each instance (at least 15%, from a cursory A/B comparison using the Python modulememory_profiler)
- The
table_drugsdata structure is implemented as adefaultdictvariant with aDrugSummaryclass as its default factory, so that indexing ([]/__getitem__) fetches or creates the corresponding record (semantically similar to anUPSERToperation in database terms). The automatic insertion ofdrug_namein theDrugSummaryinstances keeps thekeysoftable_drugsin sync with itsvalues, establishing a rough conceptual correspondencekeys-> table index,values-> table rows/records
- I implemented basic functionality to ensure proper treatment of invalid data
- In absence of an example of a real-world "dirty" dataset, only basic assumptions can be made about how defensive the conditions should be. Considering only the specific task required in the challenge, it seems sufficiently robust to inexact data, e.g. if there are transactions where the drug name is not given, these will automatically be grouped together; similarly with prescribers
- On the other hand, the aggregated results will be not correct if the same drug is recorded with different names. In database terms, this could be addressed by using normalized tables (e.g. a
drugstable and aprescriberstable) and encoding the relation between them using unique identifiers (primary keys), e.g. storingdrug_idin theprescriptionsdataset instead ofdrug_name
- On the other hand, the aggregated results will be not correct if the same drug is recorded with different names. In database terms, this could be addressed by using normalized tables (e.g. a
- Without explicit requirements for invalid data, my approach is to notify the location and the nature of the error, and skip the corresponding record. In the current version, the notification is printed to stdout. This can be modified using a different
LOG_SKIPPEDfunction - The "conversion" (converting the raw types given by the input format reader to domain-appropriate ones) and the "validation" (deciding if a record should be processed in the analysis) steps are conceptually separate (although there is some amount of overlap), so I keep them separate in the code as well. The criteria for validation belong logically both to the input record and the analysis being performed, so they are implemented as an independent function rather then as a method of the
Prescriptionclass - A failed deserialization or conversion is propagated to the client as a
ValueErrorexception - Among the possible erroneous data types considered:
- Wrong number of values, potentially caused by missing separators in the input file
- Non-numeric input values when numeric values are expected
- Null values (as exemplified in the
is_valid_stricter()validator)
- I decided to separate different parts of the project as a Python package
pharmacy_countingcomprising a few modules:utilcontains all common utilitiesconfigcontains common settings, that can be accessed and imported by other modulesdatacontains the model classes, as well as the data-specific functionsruncontains the actual operations in the analysis
- NOTA BENE: this package structure requires an additional step when invoking the
runmodule, i.e. adding thesrc/directory to the Pythonpath (this is done in therun.shscript, by adding that directory to thePYTHONPATHenvironment variable)- This is not needed when installing the package using e.g.
setup.py
- This is not needed when installing the package using e.g.
- When representing currency, using floating point numeric types (
floatin Python) is discouraged. Having never dealt with money in a production environment, my first choice would be to research best practices and/or libraries to deal with currency, at the level required for the application. Under the challenge restrictions, I chose to use theDecimalclass from thedecimalmodule in the Python standard library for fixed-point computation - A lower-level alternative would be to represent the smallest currency unit (in case of USD, 0.01 USD) as
ints, and then multiply or divide (in case of USD, by 100) opportunely. For this particular application it would be OK (since only sums are involved), but unexpected results may occur if dealing with multiplication or division. Again, the optimal implementation depends by the precision needed by the application - There can be a potential inconsistency when the result of a cost computation is a round number. In the default test provided, the costs are expressed in whole dollars (i.e. 200 USD shows up as
200), but when computing a larger amount of prescriptions, the probability of it having a whole dollar result is low (naively, 1/100 = 1%). This can create ambiguity and/or inconsistency when e.g. displaying200instead of200.00. If the application requires consistency in this case, a variant ofserialize_currencycan be used to explicitly write the amount in cents even for whole dollar amounts
- I successfully ran the analysis on the large dataset provided, but because of the large size I did not include the dataset in the directories provided
- Without careful profiling and optimization, the performance seems acceptable: the runtime (single-threaded) is about 2--3 minutes, and the memory consumption of about 2--3 GB
- There are several names (both
prescriber_first_nameandprescriber_last_name) that contain single quotes (') characters. This might require extra attention/quoting. Most CSV dialects or variants use double quotes for quoting, so they should not pose a problem for the purpose of the challenge. If the original dataset is not - The
idfield of the input file (prescriptiontable) is not unique to each line. It seems to be an integer, compatible with a timestamp (UNIX epoch), thus readable as adatetimeobject, however I refused the temptation to guess and did not include this information in the code (it was in any case not needed)
In addition to the ones in the insight_testsuite, additional unit tests are available in the tests subdirectory. The tests require pytest to run (pip install pytest; pytest -v name_of_test_module.py).
In other words, what would I do differently?
- Of course, the most important input to decide on any further extension or modification to the code are the requirements of further analysis
- An extension of the class-based approach used could lead to an ORM-like implementation, where:
- all the domain logic is concentrated in
Modelsubclasses - Records are represented as model class instances, stored and managed transparently from the client, and are accessed in the client from class instances with e.g. a
Model.get_or_create(primary_key)class method The advantage is that the API would be a bit cleaner, and would allow e.g. to optimize or rearrange the "backend" (the layout of the database table equivalent, at the moment, a subclass of a Pythondict) independently of the analysis code. However, this requires careful consideration, both to research the best architecture to implement it, and most importantly to decide whether the added complexity would actually be worth the flexibility: for this reason, I did not progress further with this idea.
- all the domain logic is concentrated in
- One could argue that the separation between the
dataandrunmodules, or, more conceptually, areas, is somewhat porous, and should be researched more carefully- If the workflow of potential future analyses is similar, the I/O operations could be factored out, and possibly moved to a
Tablelikeinterface in theModelclasses
- If the workflow of potential future analyses is similar, the I/O operations could be factored out, and possibly moved to a
- To improve performance, the necessary step is to profile the code carefully, and only then tackle clear performance hotspots
- In general, were the profiling results to show that this is a CPU-bound operation, one approach could be to use parallel processing using Python multiprocessing to exploit all available processors on a multicore machine
- This is an "embarassingly parallel" algorithm, since there is no dependency in the input data, but the necessity of merging partial aggregation operation could introduce a potentially significant overhead
- From the point of view of the code, this would require splitting the input lines between the desired worker units, each producing a
table_drugsinstance, and then implement an additional function to merge them to a single table
> /usr/bin/time -v ./run.sh
Command being timed: "./run.sh"
User time (seconds): 153.61
System time (seconds): 1.74
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:35.38
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2468256
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 13
Minor (reclaiming a frame) page faults: 709591
Voluntary context switches: 147
Involuntary context switches: 656
Swaps: 0
File system inputs: 2268280
File system outputs: 152
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0