Jongyu Lin

Jongyu Lin

I’ve improved backoffice workflow, built websites from scratch, customized e-commerce carts, pinpointed issues with data analytics, and worked on legacy codebases. I make it a point to understand your business so that I can provide the best technical solutions. How can I help you?


Featured

Sankey Diagrams

Posted on .

Example Sankey Diagram

Sankey diagrams give you some general insights into your site that can lead to further analysis. In the above example, a decent percentage of the users visit the FAQ page prior to shopping the site. It’s not something I would have necessarily known about based on the standard analytics. I would see traffic going to the FAQ and to shopping, but not necessarily make the connection. I also wouldn’t see it as a natural funnel. In this case, it would make sense to focus on the FAQ page and try to increase engagement from the FAQ to utilizing the cart. Disclaimer: the data is made up, but I did encounter a very similar situation in my work.

The tools that I used for this tutorial are awk, Mike Bostocks sankey library (with example here), and D3.js. The scripts that I wrote were based off of parsing data from nginx log files.

The first thing you will want to do is create a processed log file isolating the data you want. It could look something like the following:

awk '($7 !~ /^\/robots.txt/ && $7 !~ /^\/admin/ && $7 !~ /^\/images/ && $7 !~ /^\/favicon.ico/ ($9 == "200" || $9 == "302")) {print $1","substr($4,2)","substr($6,2),$7}' logfile | sort -s -k1,1 > processed_logfile

The above may be different depending on the log file format. In this case, the 1st column was the IP, the fourth colomun was the timestamp, the 6th column was the request type and the 7th column was the URL. The command also filters out URLs that we may not want, such as static files. Finally, we sort the results by the first column (IP) only.

I wrote a simple Python script to transform the data so it could be used for sankey.js:

def create_vertex(name):
    return {'name': name, 'edges': []}

vertices = [create_vertex('direct')]
with open('processed_logfile', 'r') as logfile:
    last_ip = None
    last_to_vertex_idx = None
    for line in logfile:
        # set variables
        row = line.split(" ")
        ip = row[0]
        request = row[2] + " " + row[3].rstrip()

        # create new session, otherwise use last to index
        if ip != last_ip:
            from_vertex_idx = 0
        else:
            from_vertex_idx = last_to_vertex_idx

        to_vertex_idx = None
        for idx, vertex in enumerate(vertices):
            if request == vertex['name']:
                to_vertex_idx = idx
        if to_vertex_idx is None:
            to_vertex_idx = len(vertices)
            vertices.append(create_vertex(request))

        # see if this edge already exists
        found_edge_idx = None
        for idx, edge in enumerate(vertices[from_vertex_idx]['edges']):
            if edge['to'] == to_vertex_idx:
                found_edge_idx = idx

        if found_edge_idx is None:
            vertices[from_vertex_idx]['edges'].append({'to': to_vertex_idx, 'weight': 1})
        else:
            weight = vertices[from_vertex_idx]['edges'][found_edge_idx]['weight'] 
            vertices[from_vertex_idx]['edges'][found_edge_idx]['weight'] = weight + 1

        last_ip = ip
        last_to_vertex_idx = to_vertex_idx

print '{"nodes":['
for vertex in vertices:
    print '{"name":"%s"},' % vertex['name']
print '],'

print '"links":['
for i, vertex in enumerate(vertices):
    for edge in vertex['edges']:
        print '{"source":%d,"target":%d,"value":%d},' % (i, edge['to'], edge['weight'])
print ']}'

You can then run the above to generate a json file to replace energy.json from the previously mentioned example.

The python script itself is more of a starting point as I’ve left out some of the details. For one, the script assumes all traffic is direct, but you can parse referrals to generate other root nodes. If you run the code basically as-is, you’ll get a lot of data, including funnels that the majority of users never take. Even so, it will still give you a rough overview, and let you decide which parts of the overall flow you want to focus on and explore.

Here are some other modifications you’ll most likely have to make:

Combine certain pages into a single node. Sometimes you may see nodes where the directed edges point back to each other creating a cycle. I’ve seen this happen with informational pages where users may navigate in a non-specific order, like visiting /info, /faq, and /about versus /faq, /info, /about. While the weights would not be uniform, not much can be gleaned from these cyclical edges, especially if it makes up a small percentage of traffic. In these cases, it may make sense to combine the pages into a single node.

Remove nodes and edges. With a lot of data, we usually want to tell some sort of story and we can remove the parts that are extraneous to that story. It isn’t always about removing the least trafficked edges or nodes either. I’ve seen cycles or interesting outliers that indicate the user could be confused. These data points may not be a large percentage of the data, but could indicate problems and the paths users take to resolve them. An even larger percentage could be affected but they just abandon the site instead of clicking through to more pages.

Defining the session. In my sample code, the session is only defined per user but most likely, we’ll want to put in a check for new session either through time since last request or a combination of URLs accessed and time.

Remove cycles. This is more of a warning that the original sankey.js does not support cycles, but if needed, take a look at the work of Colin Fergus to add cycles to sankey.js.

I hope this has been a good starting point for making sankey diagrams and drawing some insights into your data.

Featured

Here Be Monkeys

Posted on .

I recently had an interaction with a developer that didn't like the monkey patching pattern used on one of the websites I had created. I built the project on top of a Rails engine, which means it's an application with its own routes, controllers, models, and views that can be used in my project. I'll refer to it going forward as the Engine. While this is actually a common design pattern, I wanted to try to have a dialogue without falling back on that as the only reason.

The problem with monkeys

Monkey patching allows the programmer to overwrite any class at runtime. While this is quite powerful, it can cause issues because the object referenced may no longer represent the original class's specifications. An example would be overwriting Array#size to have a completely different method definition and return type. When someone looks up the Ruby Array documentation, they would expect it to behave in the original manner but the behavior is now unexpected because the programmer has completely changed it. The example is rather blunt, but in practice, the modifications would be more subtle and on methods with more complexity. I overwrite the response to a route to modify a small part of the response and introduce a bug. A developer down the line may spend hours looking at the non monkey-patched version trying to figure out why the response is slightly different not realizing the code had been messed with in another place.

Into the jungle

The issue is that I want to customize the Engine for my own application and not have to spend too much time doing so. I could pull in the code that I need to overwrite and make it part of my project. If I do this, the code will be within my application's domain and no longer a part of the Engine. Or with monkey patching, I could go in and change what I want in the engine itself. In this case, I tend to end up writing less code.

To illustrate why less code would be written, let's suppose I want to add a method to one of the Engine's models. Without monkey patching, I would have to create my own class within my application. Since the Engine only knows about the original model, I would then be required to also pull in the controller into my application to update the model reference. If instead I modify the original model to add the method directly, the Engine will already be using my new class. I can save the step of importing the controller. Even though this seems very beneficial, I already mentioned that monkey patching comes with dangers, but what if we could mitigate some of that danger?

The main problem with monkey patching is a developer will have problems figuring out where behavior has been modified, but let’s establish a rule that the monkey patches can only exist in a predictable file structure. The structure can be similar to the hierarchies used in HMVCs. I confine my patches, let’s call them overrides, in the same directory structure with _decorator appended to it. For example, if the Engine has app/models/post, I would write a monkey patch at app/models/post_decorator. I believe this comes out to be a very good compromise. It's not a perfect solution as there are still pitfalls that can occur. For one, it’s still less obvious and explicit than other patterns, like inheritance. If the strategy is applied to extensions of the app that load before my own application, it’s more difficult to determine how the class is composed.

Monkeys Like Company

What I've been describing is a common pattern recommended by different Rails' engines, such as Refinery CMS. Since it's a common strategy, I believe others have weighed the pros and cons and arrived at the same conclusion that I did. Other people may still arrive at a different answer. I think it’s always good to weigh the pros and cons of strategies used in a codebase, and it’s not always a simple case of never ever doing someone because it’s not a best practice.

Featured

Digging into ActiveRecord and PostgreSQL Enums

Posted on .

I came across an interesting problem in one of my ActiveRecord models (paraphrased, this isn’t the exact model):

class Event < ActiveRecord::Base
  attr_accessible :certainty
  validates :certainty, :inclusion => {
    :in => %w(less neutral more),
    :message => "%{value}"
  }
end

The problem is I would set certainty to one of the accepted values, let’s say 'less', and the form would wind up throwing an error. I overrode the default error message just to retrieve the value and it turns out the value for certainty is 0.

The reason this was happening is because certainty is defined as a PostgreSQL enumerated type:

CREATE TYPE certainty AS ENUM ('less', 'neutral', 'more');

and these are the type detection methods in ActiveRecord:

From lib/active_record/connection_adapters/postgresql_adapter.rb:

def simplified_type(field_type)
  case field_type
  # Numeric and monetary types
  when /^(?:real|double precision)$/
    :float
  # Monetary types
  when 'money'
    :decimal
  when 'hstore'
    :hstore
  # Network address types
  when 'inet'
    :inet
  when 'cidr'
    :cidr
  when 'macaddr'
    :macaddr
  # Character types
  when /^(?:character varying|bpchar)(?:\(\d+\))?$/
    :string
  # Binary data types
  when 'bytea'
    :binary
  # Date/time types
  when /^timestamp with(?:out)? time zone$/
    :datetime
  when 'interval'
    :string
  # Geometric types
  when /^(?:point|line|lseg|box|"?path"?|polygon|circle)$/
    :string
  # Bit strings
  when /^bit(?: varying)?(?:\(\d+\))?$/
    :string
  # XML type
  when 'xml'
    :xml
  # tsvector type
  when 'tsvector'
    :tsvector
  # Arrays
  when /^\D+\[\]$/
    :string
  # Object identifier types
  when 'oid'
    :integer
  # UUID type
  when 'uuid'
    :uuid
  # Small and big integer types
  when /^(?:small|big)int$/
    :integer
  # Pass through all types that are not specific to PostgreSQL.
  else
    super
  end
end

and lib/active_record/connection_adapters/column.rb:

def simplified_type(field_type)
  case field_type
  when /int/i
    :integer
  when /float|double/i
    :float
  when /decimal|numeric|number/i
    extract_scale(field_type) == 0 ? :integer : :decimal
  when /datetime/i
    :datetime
  when /timestamp/i
    :timestamp
  when /time/i
    :time
  when /date/i
    :date
  when /clob/i, /text/i
    :text
  when /blob/i, /binary/i
    :binary
  when /char/i, /string/i
    :string
  when /boolean/i
    :boolean
  end
end

The field_type in the above method calls is 'certainty.' The simplified_type in PostgreSQLColumn doesn’t match on any of the cases and it gets passed to the parent. In the parent, Column, 'certainty' matches against 'when /int/i' and returns integer as a type. Once the integer is set as the type, ActiveRecord does its thing and converts the attribute to an integer prior to a save. The validation then triggers on the changed value.

I couldn’t figure out where the variable actually gets type checked and changed to the correct type value. I tried modifying the column type directly in the model but that didn’t prevent the variable from being converted. My solution was to simply ALTER TYPE certainty RENAME TO certain. Postgre cascaded the changes and everything ended up being okay.

I want to thank Joe for taking the lead and filing a bug report here: https://github.com/rails/rails/issues/7814.

Featured

Placemixer: UI Experiments in Maps

Posted on .

Back in 2008, I worked on a concept called Placemixer, an itinerary planning web application.

We wanted to create a map application with an emphasis on relevant points of interests such as airports, hotel, and restaurants. When traveling, we believed that being able to quickly locate these points of interest would be really helpful. They could then easily add these points to their itinerary..

Screenshot of main map interface

The itinerary has two types of entries: regular and time–sensitive. Time sensitive entrie are at a specific time, such as catching a show on Broadway or going to a restaurant reservation. Since these entries don't have as much flexibility in time, regular entries would be grouped by distance to time–sensitive ones. The idea is to create options through flexibility in the itinerary.


Screenshot of the itinerary sidebar

Gravitate is an extension of the above idea. By clicking on Gravitate for a place, the map would zoom to show all the locations closest to the selection. For the prototype, we were applying the feature only to hotels, because it made a lot of sense for people to plan their hotel options based on the proximity to the things they want to do.

Once the user was done, a printable version of the itinerary is offered to the user featuring directions, the itinerary listing, and nearby points to each item.

Preview of Generated Itinerary

Looking at it now, I would make a few changes to the interface. I would move an itinerary timeline to the top for pagination and create another screen with small versions of the maps corresponding to each itinerary.

Mockup for timeline view

The entire concept was targeted towards traveling to places foreign to the user. The user may travel to a place and not realize the location of the closest airports, or where certain landmarks are relative to each other. For these users, a printed itinerary would also come in handy in case Internet access was unavailable.

I believe that technology will bring about change in how we plan trips. TripIt is a perfect example of a great application simplifying complex travel arrangements. Augmented reality and the proliferation of mobile phone access to the Internet have become more widespread in the past few years. Placemixer, in the last incarnation, was targeted towards power users who would want an interface to plan out trips. I don’t know if there’s a market for such an application. I instead believe editorially created travel guides or quick auto–planning guides combined with location aware applications is the future in travel. This satisfies two core ideas, expedient planning and adaptability.

Featured

Getting Started with Map Tiling: Mapnik and Shapefiles

Posted on .

About a year ago, I was working on launching a website with a map interface. I wanted full control over everything, including generating my own map tiles instead of using a third-party provider like Google Maps. It ended up being a bad idea, because map tiles take up a lot of resources to generate and store. By going through the process I have a much better understanding of map tiles and I hope this will serve as a step-by-step process showing how to generate tiles from start to finish.

My environment is Ubuntu 8.10. To start with, I installed the following packages from the Terminal:

sudo apt-get install postgresql-8.3-postgis
sudo apt-get install python-mapnik
sudo apt-get install libmapnik-dev
sudo apt-get install imagemagick

The next thing I needed was data and I was able to obtain and play around with a San Francisco sample shapefile from Navteq.

A few important things should be noted about shapefiles. A shapefile is actually multiple files (.sbn, .shx, .shp, .dbf) connected by their filename. A shapefile is like the name describes: a file containing data for a bunch of shapes. For example, it may contain data that says draw a line from one geospatial coordinate (latitude, longitude) to another. Or it may contain data that says draw a polygon with a vertex at different coordinates. It can also contain a combination of all of these.

Each shapefile defines its own map, and for complicated maps, we typically need to overlay multiple shapefiles. One shapefile may define the city boundaries while another shapefile has all the streets. If we wanted both city boundaries and streets, we would overlay one shapefile on top of the other.

Let's look at using Mapnik to interpret these shapefiles. The hard work will be based on generate_tiles.py, which I'll set-up by running the following in the Terminal from my home directory:

mkdir mapnik
cd mapnik
svn export http://svn.openstreetmap.org/applications/rendering/mapnik/generate_tiles.py ./generate_tiles.py

If you're not installing with the same directory structure as me (home_directory/mapnik), there will be some differences later on, but should still work.

This file will not be immediately ready to work with my dataset. First, there are pre-populated test cases starting on line 123:

  1. # World
  2. bbox = (-180.0, -90.0, 180.0, 90.0)
  3.  
  4. render_tiles(bbox, mapfile, tile_dir, 0, 5, "World")
  5.  
  6. minZoom = 10
  7. maxZoom = 16
  8. bbox = (-2, 50.0, 1.0, 52.0)
  9. render_tiles(bbox, mapfile, tile_dir, minZoom, maxZoom)

And so on… The first thing I did was wipe out all the code from 123 on and replaced it with the following:

  1. bbox = (-122.4, 37.76, -122.4, 37.8)
  2. render_tiles(bbox, mapfile, tile_dir, 15, 16, "SF")

Keep in mind the sample data I requested from Navteq was the SF region. If the data you have only corresponds to a certain region, then entering a bounding box (bbox) outside of the coverage will result in a bunch of empty images. You can check the shapefile with software like QGIS to make sure the data matches what you think.

I now need to create an xml mapfile. The mapfile defines the data source and how to visualize the data, which will be used by generate_tiles.py to generate the map. I created mapfile.xml and put the following:

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE Map>
  3. <Map bgcolor="#f2eff9" srs="+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs +over">
  4. <Style name="StreetStyle">
  5. <Rule>
  6. <LineSymbolizer>
  7. <CssParameter name="stroke">#000000</CssParameter>
  8. <CssParameter name="stroke-width">0.1</CssParameter>
  9. </LineSymbolizer>
  10. </Rule>
  11. <Rule>
  12. <TextSymbolizer name="ST_NAME" face_name="DejaVu Sans Book" size="9" fill="black" halo_fill= "#DFDBE3" halo_radius="1" wrap_width="20" spacing="5" allow_overlap="false" avoid_edges="false" min_distance="10" placement="line" />
  13. </Rule>
  14. </Style>
  15.  
  16. <Layer name="Streets" srs="+proj=latlong +datum=WGS84">
  17. <StyleName>StreetStyle</StyleName>
  18. <Datasource>
  19. <Parameter name="type">shape</Parameter>
  20. <Parameter name="file">Streets</Parameter>
  21. </Datasource>
  22. </Layer>
  23. </Map>

Finally, I replace lines 109 - 116 in generate_tiles.py:

  1. try:
  2. mapfile = os.environ['HOME']
  3. except KeyError:
  4. mapfile = home + "/svn.openstreetmap.org/applications/rendering/mapnik/osm-local.xml"
  5. try:
  6. tile_dir = os.environ['HOME']
  7. except KeyError:
  8. tile_dir = home + "/osm/tiles/"

with my own mapfile and tile directory, which was setup before:

  1. mapfile = home + "/mapnik/mapfile.xml"
  2. tile_dir = home + "/mapnik/tiles/"

Run ./generate_tiles.py and the map tiles should show up in the tiles sub-directory.

Other than the bounding box, which I discussed before, the only other difference would be the xml file. By doing so, we should be able to generate an xml mapfile for any data source, even one different from Navteq shapefiles.

I would leave the first lines the same and start from the first interesting bit:

  1. <Style name="StreetStyle">
  2. <Rule>
  3. <LineSymbolizer>
  4. <CssParameter name="stroke">#000000</CssParameter>
  5. <CssParameter name="stroke-width">0.1</CssParameter>
  6. </LineSymbolizer>
  7. </Rule>
  8. <Rule>
  9. <TextSymbolizer name="ST_NAME" face_name="DejaVu Sans Book" size="9" fill="black" halo_fill= "#DFDBE3" halo_radius="1" wrap_width="20" spacing="5" allow_overlap="false" avoid_edges="false" min_distance="10" placement="line" />
  10. </Rule>
  11. </Style>

A style is applied to a particular shapefile, or other data source. Each style can be composed of different rules. Let's start with the first rule in the above snippet: LineSymbolizer. Remember when I described shapefiles? They can contain lines, polygons, and so on. The Streets shapefile is primarily made up of lines. We must then create a style for these lines in order to see them in the final tiles. Shapes without a style applied will not be seen.

The next rule is the TextSymbolizer. Text for the shapefiles can be found in the .dbf file, which can be viewed with a DBF Viewer. The file is a simple spreadsheet with the text contained mapped to the shapes. The attribute name with the value ST_NAME is the link from the TextSymbolizer to the .dbf file. If we opened Streets.dbf, we will see a column called ST_NAME with each row corresponding to a street name. If you aren't using the same shapefile as I am, then you should be consulting the .dbf file to see what text can be placed into your map.

I glossed over the specific styling rules, because I found them to be quite straight-forward. To consult a complete list and explanation, bookmark Cascadenik.

The final part of my xml file is where we define the data source and set the style for that source:

  1. <Layer name="Streets" srs="+proj=latlong +datum=WGS84">
  2. <StyleName>StreetStyle</StyleName>
  3. <Datasource>
  4. <Parameter name="type">shape</Parameter>
  5. <Parameter name="file">Streets</Parameter>
  6. </Datasource>
  7. </Layer>

I think the above is straight-forward. I defined a layer, told it I'll be styling it with the style created above, and defined the data source as a shapefile named Streets. The only confusing bit I haven't gone over yet is the srs attribute in the layer. Most shapefiles I've encountered use the WGS84 datum, but if the tiles aren't coming out, this may be the cause. To verify, open the shapefile with a program like QGIS and check the default projection being used.

I'm not familiar enough with the different projections to be able to comfortably write about them in any detail. If the bounding box is correct, and the xml is correct, then the next most likely error is the projection settings being used. You may have noticed this line as well:

+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs +over

The above line is placed in the code to specify the projection of the tiles. In this case, I'm copying the very common method (at least for Internet maps) of using a Mercator projection onto a square distorting the Earth's major and minor axes into a sphere (the Earth is actually an imperfect ellipse) with a radius of 6,378,137. Projecting onto a square simply makes things easier and the distortions are not significant enough to adversely affect things like driving directions, points of interest, and so on. Since the source data can be a different projection, such as WGS84, we must specify both the source projection and the final projection, or mapnik won't know how to render the tiles.

I know my description of projections may not make too much sense, but I'm trying to cover a lot of ground as quickly as possible. It's actually quite an important topic deserving of its own article written by someone whose expertise is greater than mine. The main take-away is that the map being generated is not a perfect representation of the Earth, but it works for most of the typical purposes used in a service like Google Maps. If you need a more accurate picture of the Earth, it would be wise to consult the different projections.

For more reading on projections, I would recommend the following from Charlie Savage's blog:

I hope it's easy to see that from these simple beginnings, we can build more complicated maps. We can add additional layers and styles on top of what we've defined above. Each layer can correspond to a different shapefile. We might have a shapefile for points of interests, or one with consensus data. We can then add more details and information to our maps.

I hope it's also easy to see that creating a custom set of map tiles is no easy task. This is the main reason I abandoned the idea of making a custom set of tiles for my application. If you don't need a custom tileset, then don't go down this road. Beyond the sheer processing power to generate all the tiles needed, you also have to worry about the bandwidth needed to transmit these tiles. There's a reason why sites are using Google Maps and not creating their own tiles. It's a hassle to do the latter. But there are situations where Google Maps falls short. If you need more accurate maps for certain situations, different kinds of detailing on the tile, or are trying to compete directly as a map tile server, then you'll definitely need to look into creating your own tiles.