Friday, November 29, 2013

Debugging bayesian filters

It is a common problem when a trained filter miscategorizes something and you'd like to know why - what is really the feature that puts the most weight into the wrong category? Or sometimes you'd like to know what are the features that puts another object into the right category - maybe we should add them somehow to the object? In AI::NaiveBayes::Classification there is a method called find_predictors that finds the most features that weight most for and against the category that a given object is eventually classified under. The simple algorithm assumes that there are only two categories - but it should be possible to extend it for more categories. The returned numbers are hard to interpret - but what is important is how big they are in comparison with other numbers in the result.

We use the classifier for spam detection - whenever we get misclassified posts I check what words (or other features) push them into the wrong category and I decide what to do - should be improve the training examples, add more post features or maybe we can just ignore the case. When improving the filters, by adding or removing examples I can check how that changes the classification and also how exactly it changes the influence of each important feature on the result.

Friday, January 11, 2013

Immutable objects and Dependency Injection

At some point we had:
in our code.

Text::FeatureCount was at that time an accumulator that was counting the document features and saving them in internal structures. Multiple subroutines were using these structures - so it made sense to make them the object attributes. But it also meant that changing Text::FeatureCount into something else was a big deal. We could make the class name variable: $feature_counter_class->new->analyze(... and add an attribute to store it to the main object. But I decided to make Text::FeatureCounter an immutable object instead and inject a ready made Text::FeatureCounte into the object doing the work above. Now it can be used in the loop without re-constructing it to clean the internal structures. When coding with immutable objects you have to pass around the input data from one method to another explicitly instead of keeping it readily available in the object attributes: This makes it is slightly un-object-oriented, but the benefit is that you don't need to re-construct the object and so you can use Dependency Injection on it and make the code more flexible. It is also easier to reason about the algorithm when some parts are immutable. Often this is a good trade-off.

PS. After much other refactoring Text::FeatureCount mutated into Text::WordCounter (and AI::Classifier::Text::Analyzer) - soon to be released to CPAN.

Wednesday, October 31, 2012

Dependency Injection and open-sourcing generic parts of apps

using DI in CPAN libs makes them more universal - but DI even more important is when you want to open-source some generic part of your application. Your boss agrees and then you encounter code like this:

Thursday, August 30, 2012

Interactive presentations

At my latest YAPC talk I used questions to the audience to make sure that everything is understood. That does not mean that I asked 'do you understand it' - but rather I asked 'how would you estimate this or that' - and then I extracted from the answers the generic strategies that I had prepared to talk about. This worked spectacularly well and I think I'll add it to all my presentations.

Thursday, June 07, 2012

Why web frameworks tend to grow to become such unwieldy beasts?

Most of the web frameworks I know tend to do at least two things - the web stuff plus creation and initialization of all other components. This means that the framework is coupled with all of these components and this is the root of all evil in web frameworks. Of course you need a place to do that object creation and wiring work - but it is not really related to web stuff - it should have it's own place in the program, ideally in a Dependency Injection compartment, not necessarily a container based on the available libraries - but it can also be coded by hand (I might change my minde some day but for now I don't see any reasons to use DI container libraries in a dynamic language like Perl). All the arguments about using Dependency Injection apply also here, but even for someone rejecting DI it should be pretty obvious that reading config files and initializing objects is not much related to web stuff and if you buy the single responsibility principle you should split them into separate libraries.

I don't know Django or Rails too deeply - but I've observed this with Catalyst. Now, Catalyst is supposed to be decoupled from the Views or Models and other Plugins, there are many different ones in each category and you can replace them freely in your programs. But Catalyst code-base is still pretty big and then you have all these Catalyst::Models, Catalyst::Views and Catalyst::Plugins that don't really do any meaningful work - they only map one interface into another one. It could be much simpler if Catalyst only cared about the web related processing.

Saturday, May 12, 2012

Non compatible changes in WebNano

In my latest commit in WebNano I refactored a lot of code and changed the API in a non-compatible way. I am going to make a new release with those changes soon. I feel that doing an additional release only to warn about this fact before that sounds kind of silly - isn't announcing it here enough?