GET /postsREST/
HTTP 200 OK
Allow: GET, POST, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept

[
    {
        "user": 1,
        "title": "MongoDB on Ambari",
        "content": "Mongo does NOT come with Ambari.  And yes, it is  a pain in the ass to install.  Just trust me though, it DOES work.  It might make you wish you'd gone into the Trades instead of IT but it will work.\r\n\r\nI'm using Oracle Virtual Box with HDP VirtualBox 2.65.  This is an old version however I've found it to be a hella less of a giant memory pig than any newer ones.  Feel free to experiment.  Especially if you have a better computer than the pile of crap I'm using.\r\n\r\nSo now just do what I say and don't ask any questions my darlings...\r\n\r\nLog into Ambari with your SSH client.  Make sure to be in root because you'll be installing stuff:\r\n\r\nsu root\r\nThen cd into the following directory...\r\n\r\ncd /var/lib/ambari-server/resources/stacks/HDP/2.65 (or your version of choice)/services\r\nNow you can grab the MongoDB adapter that this guy so kindly built for us...\r\n\r\ngit clone https://github.com/nikunjness/mongo-ambari.git\r\nNow restart your Ambari service:\r\n\r\nsudo service ambari restart\r\n... in fact you should tattoo the above command on the back of your wrist, because if anything is EVER going wrong with your Ambari service first just try this.  It's the fancy equivilent of unplugging your computer then plugging it back in again.",
        "publish": "2020-01-21",
        "slug": "mongodb-ambari-2020-01-21",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "You Have to Get Ambari Installed Locally or Just Kill Yourself",
        "content": "When I started in Big Data I figured that using AWS was a great idea.  Hell, it was free.\r\n\r\nAnd then you go to bed only to remember in the morning that you left 6 clusters running because you're an idiot.  Turns out that those mistakes will suck the juice out of your credit card faster than an  Omega Compact CNC80.\r\n\r\nLike days past (PHP, MySql, ect... ) you need to do this stuff on your crappy laptop before you can step into the big leagues.\r\n\r\nLucky for us all it's possible to do now, and only a little bit of a pain in the ass.\r\n\r\nIf you're on a Mac then you're on your own.  I can't afford a precious little mac so get a $300 Dell and do the following ....\r\n\r\nFirst you need a little Oracle tech.  It's their VirtualBox that you can grab for free, the only trick is that it's huge:\r\n\r\nhttps://www.virtualbox.org/\r\n\r\nClick the big green button and go away on vacation then once you're back install whatever it is that you got.\r\n\r\nOn it's own, this VirtualBox is virtually useless until you add the HortonWorks Sandbox (HDP).  This is also free, however it's a monster.  So grab it here:\r\n\r\nDownloads\r\n\r\n\r\n\r\n... then go on vacation for a few weeks while it downloads.  Once it's done you'll need to run the VirtualBox, then select File/Import Appliance and pick the sandbox you downloaded.\r\n\r\nOne huge caveat; there are many versions of HortonWorks.  I would suggest getting a handful of different versions from the Archives.  My eventual goto version of choice was 2.65.  It seems a tad lighter than the newer ones, but feel free to experiment.  I've got at least 4 versions loaded to go, and you may have better luck with the 3.* series than I did.\r\n\r\nOnce it's loaded up you just need to Start your SandBox and follow the instructions to get a vidual Hadoop UI.\r\n\r\nYou will definitely also require a command line into it, so make sure to get Putty or some other SSH client.\r\n\r\nThe UI opening screen will tell you what address to SSH into, take note of this and login to it.  It's probably something like...\r\n\r\nHost: maria_dev@127.0.0.1\r\n\r\nPort: 2222\r\n\r\nThe default user is maria_dev (same password).  But at some point you'll need to login as an admin so just get it over with now:\r\n\r\nsu root\r\n\r\nThe default admin login is 'admin' and 'hadoop'.  Change this in your SSH session first:\r\n\r\nambari-admin-password-reset\r\n\r\nYou have to restart everyting all the time.  It is time consuming but not difficult.\r\n\r\nambari-agent restart\r\n\r\nFor the record, the above command is the most useful thing in the whole repetoir.  Whenever ANYTHING goes wrong with Ambari pop this into your console and see if it fixes it.",
        "publish": "2020-01-08",
        "slug": "you-have-get-ambari-installed-locally-or-just-kill",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Getting F*ing Drill on Ambari",
        "content": "Drill is a fun query layer that is VERY easy to use but also NOT the easiest thing to set up in an Ambari environment.\r\n\r\nIn fact it took me quite a few tries and some very small changes to get to the end game.  The following worked in my specific environment, Ambari 2.65.\r\n\r\nI hope it works for you, so give 'er a go...\r\n\r\nLogin to your Ambari SSH client and sign in as root:\r\n\r\nsu root\r\nFirst you (obviously) need to get Apache Drill...\r\n\r\nwget http://archive.apache.org/dist/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz\r\n... this is NOT the latest version, but its the only version I got working, feel free to try out others, you can see the full list here:\r\n\r\nhttp://archive.apache.org/dist/drill/\r\nYou know how these things go; nothing ever works with anything else.  You may have to experiment...\r\n\r\nOnce you've got this downloaded simply unpack it ...\r\n\r\ntar -xvf apache-drill-1.12.0.tar.gz\r\nThen cd into the created directory and run the installer ...\r\n\r\nbin/drillbit.sh start -Ddrill.exec.http.port=8086\r\n... however this is the rub; this port number (8086) is the one that worked for me.  This is not the one originally suggested to me.  I would suggest trying it first but if it doesn't work don't disregard the chance that the port number might be your only problem.  Google it.\r\n\r\nIf all of this actually works then you can open a browser, at your Ambari local http site, with that new port number, for example:\r\n\r\nhttp://127.0.0.1:8086\r\nThis is now the pretty cool part because you'll get a fairly decent looking UI for experimenting with Drill.\r\n\r\nDrill is fast and simple; SQL queries that you can use on any database and even ACROSS different types of databases.  For example you can actually perform JOINS between your Drill data and HBase, or any other datasource you may have.\r\n\r\nLemme know if you get this working with different parameter settings/methods/etc... I'd love to add them to this post.",
        "publish": "2020-01-02",
        "slug": "getting-fing-drill-ambari-2020-01-02",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "HBase and Pig and Titanic",
        "content": "Since NoSQL is the future of humanity and will save the Universe, I've thrown together this quick tutorial on how to use it in a (semi) practical sense.\r\n\r\nI’ve used Ambari, locally, to run this experiment.  Although I can’t give a full tutorial on Ambari or Hortonworks, I will provide the following links.  You’ll need to download two files (one giant), and there’s plenty of great documentation for installing and using them:\r\n\r\nHortonworks Data Flatform\r\n\r\nhttps://hortonworks.com/downloads/\r\n\r\n \r\n\r\nOracle VirtualBox (the latest version is 6.0, however I had some problems with this and have reverted down to 4.5)\r\n\r\nhttps://hortonworks.com/downloads/\r\n\r\n \r\n\r\nFor the sake of simplicity I’m using the Titanic data set (the train.csv file) which you can get from Kaggle:\r\n\r\nhttps://www.kaggle.com/c/titanic\r\n\r\n \r\n\r\nThe first thing you’ll want to do is upload this dataset into your HDFS files in Ambari (go to HDFS and files view).  I’ve put mine in a ‘titanic’ directory. You can do this with the command line too, I found it easier just to do the dashboard thing for such a relatively small file.\r\n\r\nYou’ll need to SSH into your local Ambari, being on Windows I’m using Putty.\r\n\r\nOnce you have a nice connection, you can start checking out your HBase situation.  To get to the shell just type:\r\n\r\nhbase shell\r\nAt the HBase prompt you can try a couple of things.  First just type ...\r\n\r\nlist\r\n… to see a list of current tables.  Ambari automatically installs a few examples for you, but we’ll need to make a new one for our Titanic data.  So just type …\r\n\r\ncreate ‘titanic’, ‘passengers’\r\n… which creates a new table called ‘titanic’ with a column family called ‘passengers’.  If you’re not sure what a column family is then you might want to do a bit of research on hBase and NoSQL in general.  It’s not very difficult, but some background will help when you take a look at the final product.\r\n\r\nNow for fun, type…\r\n\r\nscan ‘titanic’\r\n… which should show you a new table with zero rows.\r\n\r\nNow type …\r\n\r\nexit\r\n… to exit from the HBase shell and get back into your normal Linux prompt.  You’re going to need to get a Pig script into this location. The Pig file is as follows.\r\n\r\nA = LOAD '/user/maria_dev/titanic/train.csv'\r\nUSING PigStorage(',')\r\nAS (PassengerId:int, Survived:boolean, Pclass:int, Name:chararray, Sex:chararray, Age:int, SibSp:int, Parch:int, Ticket:chararray, Fare:float, Cabin:chararray, Embarked:chararray);\r\nusers = FILTER A by $3 != 'Name';\r\nDUMP A;\r\nDESCRIBE users;\r\nDUMP users;\r\nSTORE users INTO 'hbase://titanic'\r\nUSING org.apache.pig.backend.hadoop.hbase.HBaseStorage (\r\n'passengers:Survived, passengers:Pclass, passengers:Name, passengers:Sex, passengers:Age, passengers:SibSp, passengers:Parch, passengers:Ticket, passengers:Fare, passengers:Cabin, passengers:Embarked');\r\nFor convenience sake I’ve uploaded it to my server so you can get the file into your Ambari by typing the following (you lazy bugger) …\r\n\r\nwget http://www.matthewhughes.ca/titanic.pig\r\nNow you SHOULD be ready to go!  Simply type …\r\n\r\npig titanic.pig\r\n… and watch the magic happen.  It can take a while, so go get a coffee.\r\n\r\nOnce it’s done (successfully we hope), go back into the hBase shell, and scan your titanic table (as per instructions above).  You’re titanic data is now in hBase!  (or HBase, or hbase, or HBaSe, who knows...)",
        "publish": "2019-12-30",
        "slug": "hbase-and-pig-and-titanic-2019-12-30",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Some Useful (and Simple) PySpark Functions",
        "content": "I’ve been to Spark and back.  But I did leave some of my soul.\r\n\r\nAccording to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”\r\n\r\nAnd I’m sure it’s true.  Or at least I’m sure their intentions were noble.\r\n\r\nI’m not talking about Scala yet, or Java, those are whole other language.  I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.\r\n\r\nThe learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates.  Brevity is my goal.\r\n\r\nThis doesn’t include MLib, or GraphX, or streaming, just the basics\r\n\r\nShow pairwise frequency of categorical data\r\n\r\ntrain.crosstab('matchType', 'headshotKills').show()\r\nThis exports something like this:\r\n\r\n+-----------------------+----+----+---+---+---+---+---+---+\r\n|matchType_headshotKills| 0| 1| 2| 3| 4| 5| 6| 8|\r\n+-----------------------+----+----+---+---+---+---+---+---+\r\n| duo-fpp|3762| 608|127| 31| 7| 6| 0| 1|\r\n| solo-fpp|1955| 331| 77| 28| 6| 2| 2| 0|\r\n| normal-duo-fpp| 19| 4| 1| 0| 0| 0| 0| 0|\r\n| crashtpp| 1| 0| 0| 0| 0| 0| 0| 0|\r\n| squad-fpp|6547|1032|216| 56| 14| 4| 1| 0|\r\n| crashfpp| 35| 1| 0| 0| 0| 0| 0| 0|\r\n| normal-squad-fpp| 50| 9| 1| 1| 4| 1| 2| 0|\r\n| normal-solo-fpp| 4| 1| 3| 0| 1| 0| 0| 0|\r\n| squad|2397| 345| 70| 24| 5| 2| 1| 0|\r\n| flarefpp| 5| 0| 0| 0| 0| 0| 0| 0|\r\n| solo| 644| 98| 13| 4| 4| 0| 0| 0|\r\n| normal-duo| 0| 0| 0| 0| 1| 0| 0| 0|\r\n| duo|1198| 159| 55| 9| 3| 2| 0| 0|\r\n| flaretpp| 6| 1| 1| 0| 0| 0| 0| 0|\r\n| normal-squad| 1| 1| 0| 0| 0| 0| 0| 0|\r\n+-----------------------+----+----+---+---+---+---+---+---+\r\nReturns a dataframe with all duplicate rows removed\r\n\r\ntrain.select('matchType','headshotKills').dropDuplicates().show()\r\nDrop any NA rows\r\n\r\ntrain.dropna().count()\r\nFill NAs with a constant value\r\n\r\ntrain.fillna(-1)\r\nA very simple filter\r\n\r\ntrain2 = train.filter(train.headshotKills > 1)\r\nGet the Mean of a Category\r\n\r\ntrain.groupby('matchType').agg({'kills': 'mean'}).show()\r\nGet a count of distinct categories in a Column\r\n\r\ntrain.groupby('matchType').count().show()\r\nGet a 20% sample of a dataframe\r\n\r\nt1 = train.sample(False, 0.2, 42)\r\nCreate a tuple set from Columns.  Note that dataframes do NOT support mapping functionality, so you have to explicitly convert it to an RDD first (it's in the .rdd call below)\r\n\r\ntrain.select('matchType').rdd.map(lambda x:(x,1)).take(5)\r\nOrder by a Column\r\n\r\ntrain.orderBy(train.matchType.desc()).show(5)\r\nAdd a new Column based on the calculation of another Column\r\n\r\ntrain.withColumn('boosts_new', train.boosts /2.0).select('boosts','boosts_new').show(50)\r\nDrop a Column\r\n\r\ntrain.drop('boosts').columns\r\nUsing SQL\r\n\r\ntrain.registerAsTable('train_table')\r\n# sqlContext.sql('select Product_ID from train_table').show(5)\r\nThat's it for now!",
        "publish": "2019-12-13",
        "slug": "some-useful-and-simple-pyspark-functions-2019-12-1",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "How to Start a New PySpark Job",
        "content": "I’ve been to Spark and back.  But I did leave some of my soul.\r\n\r\nAccording to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”\r\n\r\nAnd I’m sure it’s true.  Or at least I’m sure their intentions were noble.\r\n\r\nI’m not talking about Scala yet, or Java, those are whole other language.  I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.\r\n\r\nThe learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates.  Brevity is my goal.\r\n\r\nThis doesn’t include MLib, or GraphX, or streaming, just the basics\r\n\r\nImport some data\r\n\r\ntrain = sqlContext.read.option(\"header\", \"true\")\\\r\n.option(\"inferSchema\", \"true\")\\\r\n.format(\"csv\")\\\r\n.load(\"train_V2.csv\")\\\r\n.limit(20000)\r\nShow the head of a dataframe\r\n\r\ntrain.head(5)\r\nList the columns and their value types\r\n\r\ntrain.printSchema()\r\nShow a number of rows in a better format\r\n\r\ntrain.show(2,truncate= True)\r\nCount the number of rows\r\n\r\ntrain.count()\r\nList column names\r\n\r\ntrain.columns\r\nShow mean, medium, st, etc...\r\n\r\ntrain.describe().show()\r\nShow mean, medium, st, etc... of just one column\r\n\r\ntrain.describe('kills').show()\r\nShow only certain columns\r\n\r\ntrain.select('kills','headshotKills').show(5)\r\nGet the distinct values of a column\r\n\r\ntrain.select('boosts').distinct().count()\r\ntrain.select('boosts').distinct()\r\nThat's it for now...",
        "publish": "2019-12-04",
        "slug": "how-start-new-pyspark-job-2019-12-04",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Quick Correlation Plot with Seaborn",
        "content": "Correlation is the simplest way to start comparing features to see which data points may line up with other data points.\r\n\r\nIt's fairly easy to get a quick visualization with the Pandas corr() function and a fancy Seaborn plot.\r\n\r\nThe only prerequisite is that you need to make sure all the data points in your set are numerical, either by default, or design, or elimination.  Once this has been accomplished, simply call the corr() function on your data set:\r\n\r\ncorr = df.corr()\r\nThen you can plot it.  Feel free to change the aesthetic defaults I've included here:\r\n\r\nplt.figure(figsize=(9,7))\r\nsns.heatmap(\r\n  corr,\r\n  xticklabels=corr.columns.values,\r\n  yticklabels=corr.columns.values,\r\n  linecolor='white',\r\n  linewidths=0.1,\r\n  cmap=\"RdBu\"\r\n)\r\nplt.show()\r\nAnd you'll end up with a fancy looking plot that should resemble this:\r\n\r\n\r\n\r\n \r\n\r\nPost navigation",
        "publish": "2019-11-26",
        "slug": "quick-correlation-plot-seaborn-2019-11-26",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Class Imbalance",
        "content": "This is an important concept when performing any kind of predictive analysis.  All it means is that it’s imperative that the variable you are attempting to predict has decent balance between binary values.\r\n\r\nSo if you’re attempting to predict, let’s say, cancer, your data must have a fair balance between positive cancer results and negative results.  If your data has 10 positive results and a million negatives, you will probably not be able to form a useful algorithm.\r\n\r\nLuckily, I found this little function that will go through your data and give you the balance in your data.\r\n\r\ndef print_dx_perc(data_frame, col):\r\n   dx_vals = data_frame[col].value_counts()\r\n   dx_vals = dx_vals.reset_index()\r\n   f = lambda x, y: 100 * (x / sum(y))\r\n   for i in range(0, len(dx)):\r\n      print('{0} accounts for {1:.2f}% of the diagnosis class'.format(dx[i], f(dx_vals[col].iloc[i],\r\n         dx_vals[col])))\r\n\r\nprint_dx_perc(breast_cancer, 'diagnosis')",
        "publish": "2019-11-19",
        "slug": "class-imbalance-2019-11-19",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Avocados! … and Plotly and DASH",
        "content": "Hello my pretties...\r\n\r\nI discovered Plotly and DASH, so here is my first attempt.\r\n\r\nhttps://mattocado.herokuapp.com/\r\n\r\nAs a colleague pointed out, 'holy over-plotting Batman!', which is totally correct.  I figured I would throw the kitchen sink at it and see what would happen.  Not sure it's of any practical use, but at least I know how it works now.\r\n\r\nhttps://www.kaggle.com/mattdata72/avocados-with-plotly-and-dash\r\n\r\nAlso thanks to that same colleague for introducing me to the whole architecture in the first place.  It's very cool stuff, and hopefully I will post some technical details soon!",
        "publish": "2019-11-02",
        "slug": "avocados-and-plotly-and-dash-2019-11-02",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "World War 2 Data Set",
        "content": "Since I’m a total WW2 nerd, and obviously a data one too, I found this great dataset that lists all of the weather conditions during the conflict and where it happened.\r\n\r\nI’m not sure what grander plans I have for it yet, so I started with the obvious and built this script which cleans up the VERY old dataset:\r\n\r\nhttps://www.kaggle.com/mattdata72/matt-ww2-cleanup\r\n\r\nHave fun and I hope someone can find use from it!",
        "publish": "2019-10-03",
        "slug": "world-war-2-data-set-2019-10-03",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Python List Comprehension with Plotly",
        "content": "This blog will assume you have the following skill-levels:\r\n\r\nPython: Medium\r\n\r\nPlotly: Basic\r\n\r\nI always seem to be finding bizarre corners of Python that stretch my intellectual abilities way past their natural sense of safety.\r\n\r\n\r\nHot Chick Thinking About List Comprehension\r\nThis time, I’m going to attempt to explain TWO concepts mashed into one very complicated (but powerful) tool.  Amazingly, I couldn't find this approach in any tutorial online.  Maybe I didn't look hard enough, but either way, here's my attempt.\r\n\r\nFirst let’s have a quick introduction to Python List Comprehension.  The name itself is daunting, but the concept itself is less so.\r\n\r\nList Comprehension is meant to replace a for-loop when creating new lists.  For example, a regular, straightforward way to break up a string into it’s constituent characters might go something like this:\r\n\r\nmyLetters = []\r\nfor letter in ‘giraffe’:\r\n    myLetters.append(letter)\r\nprint(myLetters)\r\nThis would create a new list (myLetters) containing this:\r\n\r\n['g','i','r','a','f','f','e']\r\nList Comprehension can do this in a slightly ‘simpler’ way.  Example:\r\n\r\nmyLetters = [ letter for letter in ‘giraffe’]\r\nprint( myLetters )\r\n['g','i','r','a','f','f','e']\r\nIt is philosophically similar to Python’s lamda operator.  You don’t HAVE to use it, but it can come in very handy. And here’s an example of when that’s true.\r\n\r\nPlotly.  It’s amazing, but it can also be bloody difficult to maneuver.  Especially when you’re as stupid as me. And by the very fact that you’re reading MY blog, I have to assume you are too.\r\n\r\nWe’re going to assume that I’m analyzing a set of data that contains a series of temperature measurements.  The set includes the day of the week that each temperature was taken as well as the time of that day.\r\n\r\nWe want to plot the data with 7 different lines, each one representing the day of the week.  The x axis will be TIME, and the y axis will be TEMPERATURE.\r\n\r\nEach day is represented many times in the data set, so we really want to split the data up by DAYS so that we can draw each line appropriately.\r\n\r\ndata = [{\r\n'x': df['TIME'],\r\n'y': df[df['DAY']==day]['AVERAGE_TEMP']\r\n} for day in df['DAY'].unique()]\r\nLet's rip apart this bizarre tiny thing and make some sense of it...\r\n\r\nFirst off, let's break out the List Comprehension and ignore the guts for now\r\n\r\ndata = [day for day in df['DAY'].unique()]\r\nIn or example, this will simply create a list of all the unique day-names:\r\n\r\n['TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY', 'SUNDAY', 'MONDAY']\r\nNow for the middle stuff.  I haven't specified WHAT is in these data objects, but it doesn't matter in order to get the concept:\r\n\r\n'x': df['TIME'], \r\n'y': df[df['DAY']==day]['AVERAGE_TEMP']\r\nThis is essentially the first 'day' value in the List Comprehension.  For each unique 'day' value, we're going to make an x point for the times represented, and a y point for each of the temperature values.  In the above example you'll get something like the following.  Note, we get 7 coloured lines, each one representing a day in the week:",
        "publish": "2019-09-30",
        "slug": "python-list-comprehension-plotly-2019-09-30",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "DZone Love",
        "content": "Wow, I'm chuffed.  After about 20 rejection letters DZone has just published FIVE of my articles:\r\n\r\n \r\n\r\nCleaning Data 102 - Pesky Texty\r\n\r\nWhere to Start with a New Data Problem\r\n\r\nCleaning Data 101 - Imputing NULLS\r\n\r\nCoefficiently Confused\r\n\r\nCleaning Data - Supervised Styling\r\n\r\nI may not be a Voltaire yet, but it's a pretty nice honor.\r\n\r\nDZone is a very useful (and obviously highly discriminating) site for anyone working with or learning Data Science.  I highly suggest checking it out.",
        "publish": "2019-09-22",
        "slug": "dzone-love-2019-09-22",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "5 Ways that Writing a Tech Blog can Make you Smarter",
        "content": "Anyone who is reading this article has probably changed careers several times.  Sometimes because of evolutions in technology, sometimes because of industry changes, or sometimes just due to boredom or curiosity.  There are plenty of you that don’t even know you changed; remember Actionscript 2? You’re probably using Actionscript 3 now. That was a career change. A significant one. But you probably still call yourself a Flash developer.\r\n\r\nMost people on this planet don’t have jobs that need neural updating every 5 minutes.  If you work as a production supervisor at a shipyard, or as a GP in a family practice, (or any of the other professions I wish to god I’d chosen instead), your job description and responsibilities will probably evolve at a generational rate.\r\n\r\nWhen you work in tech you don’t live that dream.\r\n\r\nUnfortunately I’ve never had a learning strategy.  You could probably put the entirety of my university notes on a single sided 4x6 index card.  Professionally it evolved somewhat; I would simply commit to a job that I knew nothing about then sweat myself to the finish line (a USB powered defibrillator humming next to me).\r\n\r\nNone of this is healthy.\r\n\r\nRecently, I’ve found something much better.  It’s called blogging. You might have heard of it.  I can guarantee that this simple exercise can make you smarter.\r\n\r\nLearning\r\nDumb Kid\r\n1) Teaching Stuff is Hard\r\nExplaining a concept to another human being is difficult.  It takes patience and a willingness to answer stupid questions with a smile.  Often those stupid questions turn out to be deep and brilliant insights.\r\n\r\nWriting with an audience in mind does three things; it outs subtle details, highlights the things that you don’t completely understand yet, and reinforces the concepts that you do.\r\n\r\n \r\n\r\n2) Built-in Editors and Fact Checkers\r\n\r\n\r\nSome call them trolls, but if someone is willing to read your ramblings and then contradict or correct them don’t take it as an affront.  This is free editing my pretties. Do you know how much $$ a professional editor costs? Me neither, but it’s probably more than free. Trolls and nerdlings love to correct things, it makes them feel smart.  But at the end of the day you’re the one who will benefit.\r\n\r\n3) Cheat Sheets in Your Own Words\r\nI don’t trust myself as far as I can throw me, (I’m pretty chubby and weak)  but there are times in life when you have to trust your past self because it’s spent a considerable amount of time attempting to help your future self.\r\n\r\nThat is way too cerebral.\r\n\r\nReading an explanation of something in your own words can reinforce a concept in seconds.  Whether it’s well written or smeared on a wall in offal, they are your words and I guarantee that you will understand them.  In fact one of my most useful resources these days are my own blogs. Not because I’m amazing and self centred (although I am) but because I can immediately return a concept if I re-read my own explanation.\r\n\r\n4) Finding Colleagues.\r\nI’m not great at making friends.  I have, and have had, great friends, people who would move bodies for me.  Human ones.  But these are friends I’ve made during life, not over a coffee discussing logistic regression.  You need some nerdling friends too. You are one of them now and some of them are pretty cool.  If you have a blog you have an instant ice breaker; you’re both in the same predicament.\r\n\r\nIt also gives you street cred.  No matter how basic or advanced your writing is, the fact that you ARE writing provides a whiff of validity; you must know what you’re talking about.\r\n\r\nThese new friends will inevitably give you tips, because nerdlings love nothing more than telling someone else how smart they are.  And free tips should never be thumbed at.\r\n\r\n5) Grammar\r\nThis may be the most important one.  Whether you’re new to English or have been speaking it your whole life, practicing writing is a must.  \r\n\r\nPeople judge, so working on your basic grammar is a big deal.  I’ve laid sweeping judgements upon people for simple grammatical errors like ‘it’s’ instead of ‘its’ or ‘wear’ instead of ‘where’.  That judgement is not fair, I admit, but it’s the reality.\r\n\r\nWe all have habitually bad grammar habits and these tend to become much clearer when you re-read your posts.  Small errors pop out like red flags. This drives me interminably crazy because I’m constantly finding and correcting small mistakes.  But at the end of the day you’re going to be better off for it.\r\n\r\n\r\n\r\nSo go.  Now. Write.  No one will make fun of you.  If anything they’ll be extremely jealous that they don’t have the guts to do the same thing.  You are under zero journalistic ethics to get everything right or perfect. You have the technological advantage of an amendable medium.  You can go back and update/fix stuff whenever you want. If you wait for the moment when your article is 100% perfect you will end up with exactly zero blog posts.  There’s nothing wrong with modifying something when you have new information come in.\r\n\r\nPosting your writing publicly will give you a sense of progress and make you realize that you are actually learning something!",
        "publish": "2019-09-07",
        "slug": "5-ways-writing-tech-blog-can-make-you-smarter-2019",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Tension with TensorFlow",
        "content": "I've just started dabbling in TensorFlow, which is an open source library for building Neural Networks (as well as other high-powered computational stuff).\r\n\r\nNeural Networks are something I learned about years ago, but of course back then it was mostly mathematical theory - we didn't really have the tools to see them work in real time.\r\n\r\nThe funnest site I've found in a long time belongs to Tensor.  It's got this amazing Neural Network Playground where you can screw around with Neurons and watch them try to figure out how to recognize an image.\r\n\r\nYou don't need to know anything to get started.  Just click the big round play button at the top, and without even choosing any particular settings you can watch it go.\r\n\r\nI strongly suggest checking this out.  I've wasted literally HOURS playing with it...",
        "publish": "2019-08-28",
        "slug": "tension-tensorflow-2019-08-28",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Vector Based Languages",
        "content": "After working in data science for a while there is one concept that I began to take for granted; Vectorization.\r\n\r\nThe term Vectorization comes from R.  It can have other names but I like Vectorization because it sounds cool.\r\n\r\nIn a normal programming language, if you want to add two arrays together it can be quite a grind.\r\n\r\nLet’s say you want to do this in regular ‘ole Python (or C or any other ‘normal’ language), you would have to build an elaborate series of for-loops, like this:\r\n\r\nd = [1,2,2,3,4]\r\ne = [4,5,4,6,4]\r\nf = []\r\nfor x in range(0, len(d)):\r\n     f.append(d[x]*e[x])\r\nprint(f)\r\n[4, 10, 8, 18, 16]\r\n\r\nMental\r\nThat’s all fine and good, but now imagine doing that with 2D matrices.  Or multiple arrays.  Or performing even more complex math on any of them.\r\n\r\nIn a Vector Based Language, you don’t have to go through that whole rigamarole.  Instead you can just do this:\r\n\r\nd = np.array([1,2,2,3,4])\r\ne = np.array([4,5,4,6,4])\r\nprint (d*e)\r\n[4, 10, 8, 18, 16]\r\nVector Based Languages let you perform mathematical functions on entire lists or matrices as though they were single objects.\r\n\r\nd = np.array([[1,2,2,3,4],\r\n[3,2,8,7,12],\r\n[11,21,26,3,43]])\r\ne = np.array([[4,5,4,6,4],\r\n[13,21,21,31,24],\r\n[51,12,22,31,46]])\r\n\r\nprint (d*e)\r\n[[   4 10    8 18 16]\r\n [  39 42  168 217 288]\r\n [ 561  252 572   93 1978]]\r\nWith a vectorized language, like R, or python with numpy, you can do these types of calculations simply and without concern about the underbelly of the process.\r\n\r\n\r\nMy Lord\r\nThank Thor for this technology. Staring at endless nested for-loops would cause me to pull my eyeballs out.\r\n\r\nAgain, I completely lost any appreciation for this important construct because getting knee deep in numpy or R will allow you to do that.  Just wait until you get back to your C programming!  Then you'll appreciate it...",
        "publish": "2019-08-06",
        "slug": "vector-based-languages-2019-08-06",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Creepy Ways to Invoke a Function in Python – Lambda",
        "content": "Whenever I begin learning a new language, I immediately get super cocky and say out loud “I know everything, I’m a genius how hard can it be!?”\r\n\r\n... before those beastly bus terminal cops come and ask me to leave.\r\n\r\nBut, as always, while snoozing under the bridge covered in my own soil, I perk up and realize that this particular language has handed me a challenge.\r\n\r\n\r\nSexy\r\nPython has a couple of very sexy ways of invoking a function.\r\n\r\nEven the name is sexy, isn't it? Python.\r\n\r\nI never digress.\r\n\r\nSure you can just define a function...\r\ndef hello(first_name, last_name):\r\nprint(\"Hello World \"+first_name+last_name) \r\nreturn\r\nThen call it...\r\nhello(“Matt”, “Hughes”)\r\nBut that's is so 90's.\r\n\r\n\r\nNot Cool\r\n\"Hey nerd, did that function come with a free Beanie Baby??\"\r\n\r\nAnd ya I know python was invented in the 50's to defeat Hitler's Spanish Armada, but sometimes old things can still seem new.\r\n\r\nUpon first exploring the lambda operator my brain contracted and shat out the word “No” several times.\r\n\r\n(Shat is not a curse word.)\r\n\r\nThe lambda operator is a quick and useful way of declaring, using and throwing away a small function...\r\n\r\nf = lambda x,y: print(“Hello World”+x+y)\r\nf(“Matt”, “Hughes”)\r\nThe general expression of a lambda function is this...\r\n\r\nlambda argument_list: expression\r\nThe lambda operator is useful when you just want a simple calculation done on a set of data ONCE. You don’t have to name it and you can use it with other operators like Map or Filter which makes it an extremely powerful tool...\r\n\r\nmyList = [0,1,2,3,4,5,6,7]\r\nresult = map(lambda x: x * 2, myList)\r\nThe above code block will apply the map operator to your little lambda function, multiplying everything in your list by 2.\r\n\r\nSome developers don't use this technique, mainly for readability reasons.  Some will claim that it is more difficult to study code like this then code with obvious declared functions.\r\n\r\nOn the other hand, many will contest that the lambda funciton is MORE readable and contructive than it's generic counterpart.  In my opinion it comes down to a situational question.\r\n\r\nA lamba call could theoretically stretch out to hundreds or even thousands of characters.  I'm not sure you're going to find anyone claiming that this makes code easier to follow.  Quite the contrary, and probably a good spot to contruct a Function in it's original intended shape.  (or an entire Class, but let's not get into that just yet.)\r\n\r\nHowever you end up using it, lamda can be a fantastic, fast (and fun) alternative.",
        "publish": "2019-07-16",
        "slug": "creepy-ways-to-invoke-a-function-in-python-lambda",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Confusion Matrix – Confused Yet?",
        "content": "Bahaha.  I love the name of this thing.  I’m sure the stats world is pulling a fast one.\r\n\r\nThis is actually not super complicated, but for some reason I can never remember which is a Type 1 Error and which is a Type 2 Error.  I suspect suspect it's because of all the fentanyl my mother did while she was pregnant.\r\n\r\nJust joking.  She's a lovely woman and never went any further than good old fashioned heroin.\r\n\r\nIf you remember anything from Uni-stats, you might remember that there are 4 types of outcomes when doing an experiment.  Two of these are correct outcomes, two are errors.\r\n\r\nTrue Positive: This is a correct outcome.  It predicts that something is TRUE when in fact it actually is TRUE.  i.e. “A cancer test comes back positive and the cancer is there”.\r\n\r\nTrue Negative:  This is also a correct outcome.  It’s when you predict something is FALSE when in fact it is actually FALSE.  i.e. “A cancer test comes back negative, and there is no cancer there”.\r\n\r\nFalse Positive:  This is an error.  We predict something is TRUE when in fact it is really FALSE.  i.e. “A cancer test comes back positive, when in fact the person doesn’t have cancer.”  Also called a Type 1 Error.\r\n\r\nFalse Negative: This is also an error.  It’s when we predict something is FALSE when it actually is TRUE.  i.e. “A cancer test comes back negative, but the person actually does have cancer.  Also called a Type 2 Error.\r\n\r\nA confusion matrix is simply a way to visualize the results of an experiment.  Example:\r\n\r\n\r\n\r\nOut of 165 test results we’ve had 150 successful ones.  50 of them were predicted to be NO and were in fact NO. 100 of them were predicted to be YES and were in fact YES.  \r\n\r\nWe’ve also had 15 Errors; 10 were Type 1 Errors and 5 were Type 2 Errors.\r\n\r\nFrom these numbers you can start to calculate a whole slew of statistics.  I won’t go over ALL of them, (you can look that up), but here are a couple:\r\n\r\nAccuracy: The number of correct results divided by the total number of results.  In this example 150/165 = 0.91 accuracy.\r\n\r\nMisclassification Rate: Essentially this is the just the exact opposite of accuracy.  15/165 = 0.09 misclassification. (You could also just figure out that this is 1 minus the accuracy rate.).\r\n\r\nAs I said, there are a slew of other stats you can come up with from a Confusion Matrix, and they are all as simple to calculate as our example.",
        "publish": "2019-07-05",
        "slug": "confusion-matrix-confused-yet-2019-07-05",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Data Visualization",
        "content": "There’s a part of me that detests data visualization.\r\n\r\nThere’s nothing wrong with it.  In fact it’s a part of this job and profoundly important to communication.\r\n\r\nIt just gets annoying dressing up my beautiful data so that some fool understand it; or should I say - LOOKS at it.\r\n\r\nHere you go sir, here’s some important data that could significantly affect the future of your company (hand puppet on), and lookit sweety, it’s dressed up with the Diamond Steel Sunset palette.  Now you're interested right?\r\n\r\nWho wipes your poopy for you?\r\n\r\n\r\nImportant Info\r\nIf you’ve ever traveled across America (and you bloody should, it's bloody amazing) and are somewhat literate you must have come across that USA Today rag.  The one that gets lovingly comped right next to the waffle maker first thing in the morning.\r\n\r\nUSA Today are the kings of deep data visualization featuring anything from Illegal Arms Sales in Africa to What Nation Eats the Most Pasta.\r\n\r\nI love these visualizations.  I'm not being sarcastic, they are genuinely very fun to pour over when you're nursing a hangover with a hot cup of brewed swill.\r\n\r\nHowever they do provide a lesson; you can make data look like anything if you douse it in enough symmetry, carbohydrates or glam.\r\n\r\nPython, (segue-way) is packed with enough visualization tools to make any product manager swoon.  So much, in fact, that I’m somewhat overwhelmed. Here is a quick list of some that I’ve been messing with.\r\n\r\n \r\n\r\nMatplotlib\r\n\r\nmatplotlib\r\nNative.  It comes with your Python distro even if you downloaded it in 2001 during the Cuban missile crisis.  It’s definitely not the prettiest, but it works great when you (a real engineer) need to know some quick answers.\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\nPandas\r\n\r\nPandas\r\nNot quite native, but easily installed.  (plus, you should probably have Pandas included in any data problem anyway).  Ascetically it's a step up from matplotlib and is pretty easy to implement.  There are a tonne of tutorials and examples online.\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\nSeaborn\r\n\r\nSeaborn. Cute.\r\nDefinitely the cutest of them all so far, and ridiculously simple to use.  In fact I strongly suggest checking out the Seaborn Examples Gallery.  It's rammed with visualization examples and all the code you'll need to make them happen.\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\nR\r\n\r\nR\r\nDon't forget R.  I kind of have and I’m still not a massive fan, but the language does support all of your bog-standard plots.  I find it trickier to use than the Python alternatives, but I also thought goats were female sheep.\r\n\r\n \r\n\r\n \r\n\r\nTableau\r\nI’m not going to tackle Tableau right now; that would just continue my medicated rant...  We'll save that for another time my pretties.",
        "publish": "2019-06-24",
        "slug": "data-visualization-2019-06-24",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Data Normalization 101 – MinMaxScalar",
        "content": "As a part of any data cleanup process you’re probably going to want to normalize your data.\r\n\r\nLike in golf, if you’re playing against Tiger Woods you’ll want to give him a handicap, just so that he won’t totally blow you away and that the game might still be fun.\r\n\r\nAlso with data you may want all things equal.  \r\n\r\nAssume you’re trying to compare the height of a building and it’s internal temperature.  The two different units are fine, except when you're trying to teach a machine about the data you may want the units to have some sort of equal measure.  (300 meters vs. 26 degrees.  To us that makes sense, but the actual value of the digits is clearly incongruous.)\r\n\r\nThere are many different ways of doing this.  Personally I like the MinMaxScalar function in the sklearn package.  First off it gives you pretty straight-forward results (everything ends up being between 0 and 1), and secondly, it's the only one I know how to use so far.\r\n\r\nSo take it or leave it my pretties...\r\n\r\nLet’s make a numpy array.  Here are 6 tuples of fictional height/temperature data:\r\n\r\nbuildingData = np.array([[300,24],[200,21],[126,18],[567,27],[420,19],[189,30]])\r\nprint (buildingData)\r\n=\r\n\r\n[[300  24]\r\n [200  21]\r\n [126  18]\r\n [567  27]\r\n [420  19]\r\n [189  30]]\r\nAs I said we'll be using sklearn to do this stuff, so first you’ll need to import the MinMaxScalar function:\r\n\r\nfrom sklearn.preprocessing import MinMaxScaler\r\nThen we need to figure out the largest and smallest data point in your data set:\r\n\r\nscaler_model = MinMaxScaler()\r\nscaler_model.fit(buildingData)\r\nThen scale the data appropriately:\r\n\r\nscaled_data = scaler_model.transform(buildingData)\r\nThis will take the highest and lowest values in your data, turn them into 1 and 0 respectively, then stuff all the other values into relative numbers between 1 and zero.  So we get:\r\n\r\nprint (scaled_data)\r\n\r\n=\r\n\r\n[[ 0.39455782  0.5 ]\r\n [ 0.16780045  0.25 ]\r\n [ 0.          0. ]\r\n [ 1.          0.75 ]\r\n [ 0.66666667  0.08333333]\r\n [ 0.14285714  1. ]]\r\nVoila, all of your data is (proportionately speaking) the same as it was before, but  reducing it into relative values between 0 and 1 will allow a machine to process it appropriately.",
        "publish": "2019-06-05",
        "slug": "data-normalization-101-minmaxscalar-2019-06-05",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Kaggle",
        "content": "This site is amazing.  I might be out of the loop because I’ve only just recently discovered it.\r\n\r\nInevitably some antiseptic nerdling will scream about how behind the times I am and how she can't believe I haven't seen the new Han Solo movie yet.\r\n\r\n\r\nHave Not Seen\r\nWell, I also I just discovered the dark web too, but I’m scared of that.\r\n\r\nKaggle is the center of the Universe when it comes to learning Data Science.  First off, it’s got a DataSet section packed with stuff you can practice on (including the Titanic set we've looked at earlier) as well as related problems you can try to solve.\r\n\r\nThe exciting part however is the Competitions section.  Here you can pit your brains against other nerdlings to try and solve more complex problems.  There are even companies willing to pay some decent $$ to help out with their data.\r\n\r\n\r\nA Nerd\r\n \r\n\r\nWho wants to have a go??  I'll crush your little brains my pretties!",
        "publish": "2019-05-27",
        "slug": "kaggle-2019-05-27",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "R Comedy Graph",
        "content": "I’d completely forgotten about this.  It’s a total throwback and completely useless. \r\n\r\nDuring the previous American election, while I was learning R, I threw a cheeky info graph together to measure the humor value of each candidate, assuming that candidate won the presidential vote.\r\n\r\nTurned out I was right!  And it’s been a riot ever since.  \r\n\r\nI’m Canadian though, so I didn’t have a say.  \r\n\r\nAs a comedy fan I was just chuffed by the outcome.",
        "publish": "2019-05-02",
        "slug": "r-comedy-graph-2019-05-02",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Cleaning Data 102 – Pesky Texty",
        "content": "If you’re going to be doing any analysis or machine learning with your data, it’s very important to make sure that your data is readable by … a machine!  Imagine that. This often means getting rid of, or imputing (smart speak) any data that isn’t in a numerical format.\r\n\r\n\r\nDames. Poison.\r\nComputers love numbers.  I also love numbers. And flowery women with black glasses and a random lust for life.\r\n\r\nAlso booze.\r\n\r\nBut that is irrelevant because it turns out that computers still hate words.  And pictures. And babies.\r\n\r\nEspecially pictures of babies.\r\n\r\nSorry love.  I hate your new baby.\r\n\r\nSo in order to start, its important to try and turn your pesky letters into numbers.\r\n\r\nLuckily Python can help.\r\n\r\nQuote of the CenturyLike a lot of data science stuff, you have to make more stuff first before you can make less stuff later.\r\n\r\nThat’s a sick quote!\r\n\r\nLet’s just try dealing with Sex.  This is a good attribute to start with because it can only consist of 2 values, male or female.  Well… for the sake of this lecture let’s just make that assumption.  And don’t get all political on me my pretties.\r\n\r\nWe’ll use the Titanic data again.\r\n\r\nWhen we originally take a look at the Sex column of the data, this is what we have:\r\n\r\n0 male\r\n1 female\r\n2 female\r\n3 female\r\n4 male\r\n5 male\r\n6 male\r\n7 male\r\n8 female\r\netc...\r\n\r\nEssentially what we want to do is convert the ‘male’ and ‘female’ values  into 1’s and 0’s.  Because that’s what computers like.  \r\n\r\nThey also like taking over the Earth and destroying all life.\r\n\r\nBut that's also irrelevant.\r\n\r\nPandas is beautiful and thankfully it gives us a very simple way of doing what we need doing here with the pd.get_dummies function.\r\n\r\nWhat this function does is create a new dataset and splits all the possible values of your input data into new columns containing numerical data:\r\n\r\nsex = pd.get_dummies(titanic['Sex'],drop_first=True)\r\nIn the above example, the new sex dataset will look like this:\r\n\r\nM  | F\r\n1  | 0\r\n0  | 1\r\n0  | 1\r\netc...\r\n\r\nWe can then remove the pre-existing Sex column...\r\n\r\ntitanic.drop(['Sex'],axis=1,inplace=True)\r\n... and replace it with the new sex column of zeros and ones.\r\n\r\ntitanic = pd.concat([titanic,sex],axis=1)\r\nIf you didn't notice, you also dropped one of the two columns in the new dataset you created, because you don't need a male and female column, since the two values are mutually exclusive (You can't be male AND female.  Well... in the Titanic times you couldn't be)\r\n\r\nSo what you end up with is a new column in your main dataset called 'male' that looks like this:\r\n\r\n0 1\r\n1 0\r\n2 0\r\n3 0\r\n4 1\r\n5 1\r\n6 1\r\n7 1\r\n8 0\r\nDone like dinner....",
        "publish": "2019-04-23",
        "slug": "cleaning-data-102-pesky-texty-2019-04-23",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Machine Learning 102 – Supervised Stylingz",
        "content": "Learning! Apparently computers can do it now. Kind of.\r\n\r\nWell, not really.\r\n\r\n\r\nA Cute Ape\r\nIn fact a 4 year old can still learn much more efficiently and with a fraction of the the input. Even with an orangutan for a teacher.\r\n\r\nBut let’s not get political my pretties. Shame on you.\r\n\r\nThere are currently 3 ways to educate a computer. The first one is called Supervised Learning, and that’s what we’ll be talking about today.\r\n\r\nSupervised Learning involves taking a set of data, splitting it up (randomly) into a test set and a teaching set. The teaching set is then fed into an algorithm, which attempts to extrapolate enough data to be able to predict what’s in the testing set. If it comes out with a decent divination then the algorithm is solid.\r\n\r\n\r\nVery Simplified\r\nThis is why it’s called supervised; because we’re essentially showing the algorithm a ‘correct’ outcome, before then feeding it unknown stuff.\r\n\r\nFor now let's just look at how to set this scenario up.  The sklearn library makes all of this super easy to do. For example let’s just look at splitting up the data (step 1).\r\n\r\n\r\nAmerican House\r\nAssume that you’ve imported some US housing data and you want to train your algorithm to predict the price of a house based on several other variables. First split the dataset into two other sets, one with your prediction variable of choice (y) and the other with everything else (X)\r\n\r\n \r\n\r\nX = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]\r\ny = USAhousing['Price']\r\nNow make sure that you've imported this curious little library :\r\n\r\nfrom sklearn.model_selection import train_test_split\r\nThe train_test_split function is specifically designed to chop up your data, randomly, into 4 other datasets, both an X and y for training, and an X and y for testing.\r\n\r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)\r\nThe test_size parameter in the above line indicates how much of your data you want to put in the test set.  I usually use around 30% (0.3) but this can be adjusted accordingly.\r\n\r\nNow fit the data:\r\n\r\nlm = linear_model.LinearRegression()\r\nmodel = lm.fit(X_train, y_train)\r\npredictions = lm.predict(X_test)\r\nYou now have this nifty little array called predictions that contains a whole bunch of information about how well your algorithm did in testing the data.\r\n\r\nWhat you do with this data is for another blog...\r\n\r\nBut in the meantime I know you're smart my pretties, so see what you can discover.",
        "publish": "2019-04-10",
        "slug": "machine-learning-102-supervised-stylingz-2019-04-1",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Where to Start with a New Data Problem",
        "content": "So I get a data file, CSV, text, etc…. and my usual first step is to stare at the file in my Downloads folder for a few minutes.  Then maybe change the file name. Then go make some coffee. Then come back and read the name of the file again. Maybe change it back.\r\n\r\nI’ll open up some IDE and make a new python file.  Save it.  Stare at that. Import some libraries… that name sucks I should change it.\r\n\r\nCNN is on, I should probably see what's happening in the world...\r\n\r\nMy point is that it’s hard to start.  And the best way to start is just to start.  Here’s a good list of things to put in your py file to at least get a handle on what you’re dealing with and hopefully get some juices flowing.\r\n\r\nImport Libraries\r\nYou might not need them all, but you can always remove them later.  This tactic is probably bad form, but I don't care, it helps...\r\n\r\nimport pandas as pd\r\nimport numpy as np\r\nimport matplotlib.pyplot as plt\r\nimport seaborn as sns\r\nfrom sklearn.model_selection import train_test_split\r\nfrom sklearn.linear_model import LinearRegression\r\nfrom sklearn.datasets import load_boston\r\nfrom sklearn import metrics\r\nImport Your Data\r\nWithout importing your data you're bound to have a tough time figuring out what you're dealing with.  And for some reason, once I've inaugurated a Panadas dataset I feel like I'm on the way...\r\n\r\ncustomers = pd.read_csv(\"Ecommerce Customers.csv\")\r\nGet Some Visualizations Going\r\nI like to start with the very basics.  Just these 4 lines will give you a tonne of information about your data and where you should start probing...\r\n\r\nprint(pdf.head())\r\nprint(pdf.info())\r\nprint(pdf.describe())\r\nprint(pdf.columns)\r\nprint(pdf.shape)\r\nprint(pdf.dtypes)\r\nPrint Some Nice Plots\r\nEveryone likes a good visualization.  It gives you a quick feeling of accomplishment and a head start toward finding gabs, dead-ends, etc...\r\n\r\nsnsData = sns.load_dataset('tips')\r\nprint(snsData.head())\r\nprint(sns.pairplot(snsData))\r\nprint(sns.distplot(snsData['some_column']))\r\nsns.heatmap(snsData.corr(), annot=True)\r\nAlthough none of these things are the answer to your underlying problem, they are a sure-fire way to get the coffee brewed, the TV turned off and your project underway.",
        "publish": "2019-04-10",
        "slug": "where-start-new-data-problem-2019-04-10",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Cleaning Data 101 – Imputing NULLs",
        "content": "Even though it seems like a bit of a grind, cleaning your data can be the most creative part of the job.\r\n\r\n\r\nFun Times\r\nIf you’re doing any sort of machine learning with your data, NULL values in your set are going to drive you mental\r\n\r\nSo, my pretties, let’s start at the beginning and impute the empty data from a set.  (For those of you not as smart as me, imputing is just a fancy way of saying ‘replace’.)\r\n\r\nI’ve been using the Titanic data, which is a fairly popular learning set, you can find it here: Titanic Data\r\n\r\nI’ve already imported the csv file:\r\n\r\ntitanic=pd.read_csv(\"titanic_train.csv\")\r\nThe first thing to do is create a nice little heat map to see where the NULLS are:\r\n\r\nsns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='cubehelix')\r\nThe 'cmap' value in the above command will determine the color pallet used in your heat map.  Feel free to search for other options if you don't like the white on black.\r\n\r\n\r\n\r\nWe see in the above plot that there are several NULL values (in white).  The first one's we'll tackle will be in the Age column.\r\n\r\nThere are lots of ways to impute these values.  I've decided to find the average age of each of the 3 possible Cabin values, and apply this average to each of the missing values determined by which Cabin the missing passenger was traveling in.  The function is a bit sloppy, as was my description, but here's what I concocted:\r\n\r\ndef impute_age(cols):\r\n  Age = cols[0]\r\n  Pclass = cols[1]\r\n  ageAv = titanic.groupby('Pclass', as_index=False)['Age'].mean()\r\n  if pd.isnull(Age):\r\n    if Pclass == 1:\r\n      return ageAv.loc[0][1]\r\n    elif Pclass == 2:\r\n      return ageAv.loc[1][1]\r\n    else:\r\n      return ageAv.loc[2][1]\r\n    else:\r\n      return Age\r\nAmendum - There was an error in my the above code in my original post.  It has now been corrected.  Now I must commit ritual Seppuku to clear my family name.\r\nAnd to apply it to the Age values in your data use this:\r\n\r\ntitanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)\r\nNow your heatmap should look like this:\r\n\r\n\r\n\r\nThe next column to tackle is the Cabin value.  Since there are tonnes of NULL values in this, and since we don't really need it anyway, let's just drop the whole thing:\r\n\r\ntitanic.drop('Cabin',axis=1,inplace=True)\r\nYour plot should now look like this:\r\n\r\n\r\n\r\nThat little one remaining guy we'll just scrap too:\r\n\r\ntitanic.dropna(inplace=True)\r\nAnd voila:\r\n\r\n\r\n\r\nNo more NULL values!  And you've still got a solid set of data to use for more exciting things to come.",
        "publish": "2019-03-15",
        "slug": "cleaning-data-101-imputing-nulls-2019-03-15",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Coefficiently Confused",
        "content": "Statistics has an enduring ability to brand itself with interminable, confusing and utterly forgettable terminology.\r\n\r\n\r\nSome Gluons or Something\r\nI used to castigate physics for the same thing but at least they have an excuse; they're attempting to label billions of indescribable stuffs (i lay off Entomologists for the same reason).\r\n\r\nStatisticians don’t have “billions of stuffs” to label; maybe they have… a hundred?\r\n\r\nFor example; Linear Regression.  There’s nothing regressive about it.  There’s a line, sure, so it’s sort of linear, but… ah forget it we’ll get to that some other time...\r\n\r\nSo today, my pretties, we'll discuss the Coefficient.  An equally random term that simply means if you add 1 unit of something to x, y will change this many units.\r\n\r\nFor example, I just ran a job on some website traffic that was trying to track how much $$ people spent depending on how much time they spent on the various platforms:\r\n\r\n\r\ncoefficient\r\navg. session length\t25.957178\r\ntime on app\t38.697974\r\ntime on website\t0.039317\r\nlength of membership\t61.299257\r\n\r\nI'm the Boss.\r\n... so in this example for every unit of time a user spends on the app, they will spend an additional $38.69 (annually) at the shop.  For every unit of time spent on the website the average user will spend just $0.03.  (I can't remember what the TIME unit is in this example, but it doesn't matter the point is made).\r\n\r\nThe coefficient is definitely not the be-all-end-all stat but it's a great place to start any investigation.  It gives you a pretty good idea of where to start looking for further trends and what are dead ends.\r\n\r\n\r\nHouse. In Boston.\r\nFor more detail about doing this with Python check here.  It's using SciKit's built in Boston Housing data from 1970.  It's quite concise and the dude seems like a bit of a punk, so I'm sold.",
        "publish": "2019-02-05",
        "slug": "coefficiently-confused-2019-02-05",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Machine Learning 101",
        "content": "I'm currently tackling Machine Learning with the Python Sci-Kit package (and other libraries).\r\n\r\nI'll eventually turn this topic into a full fledged page, but for the moment I wanted to share this cheat sheet, straight from the sci-kit website.\r\n\r\nIt's a fantastic visual representation of how to go about figuring out which algorithm(s) from the (extensive) sci-kit package to use, depending on the requirements demanded by your problem.\r\n\r\nI like this visual representation a lot.  I'm looking for similar guides for other Python packages, so far I've come up empty.\r\n\r\nIf you see anything, please let me know!",
        "publish": "2019-01-29",
        "slug": "machine-learning-101-2019-01-29",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Cheat Shmeet",
        "content": "I am horrible with syntax.  Throw me a language, blind, and ask me to output Hello World, and as Thor is my witness I wouldn’t have a clue.  I could guess. And even provide a pretty good guess. But it would still be a guess.\r\n\r\nConsole.WriteLine(\"Hello, world!\");\r\nprintf(\"hello, world\\n\");\r\n10 PRINT \"Hello, World!\"\r\nDISPLAY \"Hello, world!\"\r\nPython? C? Basic? COBOL?\r\n\r\nYears ago I went for an interview at one of those archaic recruiting agencies that make you do blind programming exams. I left immediately.\r\n\r\nWhy would I spend valuable brain matter memorizing letters or numbers that don’t need to be memorized?  I don’t know my phone number. You know why? It’s on my phone.\r\n\r\nMy brain matter is mine.  If I decide to burn it on booze and drugs that's my prerogative.\r\n\r\n\r\nLame T-Shirt\r\nProgramming isn’t about rote.  It’s about problem solving.\r\n\r\nOne of my favorite analytics guys came up with a great interview method: put some punk in a room with a computer and some bog-standard coding problem to solve.\r\n\r\nThe caveat is that there is no power cable.  Unbeknownst to her, she's got 90 minutes to find power for the computer before even starting.\r\n\r\nIf you come back, 90 minutes later, and she's still sitting there baffled, you throw her to the curb.  If she's spent an hour and a half making friends around the office, bartering with the staff, and digging through trash cans to find a power, you hire her immediately.\r\n\r\nI went off track there...\r\n\r\n\r\nTiny Cheat Sheet\r\nCheat sheets.  Since my brain can only hold the very minimum of information I started writing my own cheat sheets for everything.  Eventually I realized that other, smarter people have been doing this for years.\r\n\r\nSo I'm lifting (stealing) the best ones I can find off the web with the intention of compiling them into a nifty collection on this site.\r\n\r\nAny suggestions?  Anyone out there?",
        "publish": "2019-01-07",
        "slug": "cheat-shmeet-2019-01-07",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Blog Number One",
        "content": "I’m a veteran of the IT business, going on 20+ years. I’ve developed applications and websites, managed the creation of many high profile products, and have done everything in between.\r\n\r\nI’ve now decided to throw it away and dedicate myself to the pursuit of big data.\r\n\r\nWhy?\r\n\r\nI'm still working on that answer.\r\n\r\nMaybe I should have just bought a mid-life crisis Corvette. Instead I bought a Raspberry Pi.\r\n\r\nBut in the meanwhile, I’ll be documenting the process in this Blog.\r\n\r\nI’m not sure that anyone will read this, including myself. But the exercise doesn’t require an audience. If you exercise for an audience you’re a megalomaniac.",
        "publish": "2018-12-21",
        "slug": "blog-number-one-2018-12-21",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Learnings Reals Good",
        "content": "I’ve had a problem with the Python data ecosystem ever since I started my quest.  Mainly because I’ve ignored the basics and winged it as work came in\r\n\r\nWhen I realized that I didn’t really understand what a Numpy array was vs a Pandas dataframe, I knew it was time to move backwards.\r\n\r\nThe truth of the matter is that learning stuff is 25% new and 75% going backwards.\r\n\r\nIt can be pretty bloody depressing.  But it’s super important to know that the backwards parts aren’t really steps back; they’re steps you missed along the way.\r\n\r\n\r\nJesus Baby\r\nI sound like a Christian drug Councillor.\r\n\r\nMy point is that I’m working on a Python Data Ecosystem page, written very specifically for morons like me to refer to.\r\n\r\nAny input at all would be appreciated...",
        "publish": "2018-11-27",
        "slug": "learnings-reals-good-2018-11-27",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "xkcd()",
        "content": "I just found this super cool Easter Egg in Python's matplotlib library that turns all of your visualizations into cartoony versions of themselves.\r\n\r\nIt works on all the plot options offered by matplotlib but be careful - i wasted an entire afternoon playing with it.\r\n\r\nJust add it to your plot object like this:\r\n\r\nplt.xkcd()",
        "publish": "2018-10-17",
        "slug": "xkcd-2018-10-17",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "The CAP",
        "content": "There are three characteristics that a database can be “goodest” at.  The three are:\r\n\r\n1) Consistency - How quickly can your dB be relied upon to show the most up to date changes.\r\n\r\n2) Availability - The reliability of your dB to respond to requests.\r\n\r\n3) Partition Tolerance - How well your dB can handle a clustered system.\r\n\r\nFor reasons that I don’t have time to explain (i.e. I don’t know) any one database can only be really good at 2 of the 3.  This makes dB selection very critical depending on the requirements of your project. It’s sort of like Heisenberg's uncertainty principle in quantum mechanics where you can’t know the position and velocity of a particle… ah never mind.\r\n\r\nSince we’re talking about big data here, number 3 (Partition Tolerance) has to be a given.  So the choice is really only between the other two.\r\n\r\nI find examples easier to grasp:\r\n\r\nFacebook: Will the world come to an end if you don’t get to see your brother’s new baby pictures for 5 seconds after they’re posted?  Probably not. So in this case one could give up a little on Consistency.\r\n\r\nE*Trade: If you buy/sell a stock online, it’s pretty imperative that that trade is successful.  In this case Availability would be critical, while Consistency (seeing the latest stock numbers immediately) could take a bit less of a priority.\r\n\r\nThe choice is yours my pretties...",
        "publish": "2018-10-02",
        "slug": "cap-2018-10-02",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "I love pi.",
        "content": "My raspberry pi came today. I love this tiny little creature, but mark my worlds, although they say it’s easy to set up it is NOT.\r\n\r\nHowever in fairness I refused to get a monitor, keyboard or mouse (for the principle, not because I’m poor) so I was going “headless”, as the nerdlings say.\r\n\r\nI literally can not be bothered to explain the process here but if anyone wants help email me and I’ll try to be useful.\r\n\r\nIt basically involves a lot of pUtty and a couple of minutes with a smart TV and an HDMI cable.\r\n\r\nNext is the pi zero. In terms of technology I’m clearly devolving. Also so is my love life.\r\n\r\nSoon I’m going to be rocking the Apollo 11 gear at Starbucks (Apollo Guidance Computer. 1024 bit core memory. F*ck I’m hawt.) How hipster would that be???",
        "publish": "2018-09-04",
        "slug": "i-love-pi-2018-09-04",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "Stats Funkin’ Blow",
        "content": "I did really well in Uni-stats.  But in fairness I was going out with a virgin-genius and so my study habits were directly correlated with my desire to invade her blonde Spark.  (see what i did there?)\r\n\r\nBut even after spending 2 months in rehab doing nothing but studying Poisson Distributions and working out with Fentanyl addicts, I found myself struggling..\r\n\r\nIt's not the math; that's elementary at best.  It's the conceptional A|B clauses that always throw me.\r\n\r\nIt's a brain f*ck.  And the older I get, although i get much sexier, my stats skills diminish.\r\n\r\nLong story short, before delving into any of this data science stuff you need to learn your stats.  Give yourself a good couple of months of dedication.\r\n\r\nEven better, get yourself arrested for some inoffensive non-violent crime and do your time in the library.",
        "publish": "2018-08-15",
        "slug": "stats-funkin-blow-2018-08-15",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "You Shouldn’t Kill Elephants … but …",
        "content": "I’ve literally spent years (maybe just months) learning and building MapReduce solutions.\r\n\r\nToday I read that it was a largely deprecated technology...\r\n\r\nLookit, I use to be a Flash developer. Before that I made a living burning Adobe Director CD-Rom apps.\r\n… so I’m used to this technology-evolution-stuff.\r\n\r\nHowever, not to sound like an old man, but: “mapreduce is an important construct to understand before tackling the principles of big data and blah, blah, blah…”\r\n\r\nBut jokes aside it’s totally true\r\n\r\nBefore l started fiddling with iPhone app development I took a whole course in C. Not C#. Or C++. Just C. That ancient language that gave the world so much. So many corny video games and antiquated databases.\r\n\r\nHaving a base is important. Especially if you think you already know it.",
        "publish": "2018-07-01",
        "slug": "you-shouldnt-kill-elephants-but-2018-07-01",
        "get_theURL": "/home/mhughes/mantmonster"
    },
    {
        "user": 1,
        "title": "aaa",
        "content": "aaaa",
        "publish": "2020-01-06",
        "slug": "aaa-2020-01-06",
        "get_theURL": "/home/mhughes/mantmonster"
    }
]