Jongyu Lin

Jongyu Lin

I work on websites, apps, and analytics. Learn more about my work


Rikaichan Japanese English Dictionary with Romaji

Posted on .

I’ve been a long time user of the Rikaichan Firefox extension, but I’ve always wished it showed romaji. I decided to release my own dictionary with romaji included. Since I didn’t want to modify the rikaichan extension itself, I had to insert the romaji into the definition. The romaji should precede the definition in parantheses.

You can install it through this link:

If you’re using Firefox, you will be prompted for permission to install.

If you aren’t using Firefox, the file will download and you can drag it into the Firefox window to install.

You can find the source here:


Multiple IPs on Amazon EC2 Debian AMI

Posted on .

For an Amazon EC2 instance, IPs are associated through a network interface. You can have multiple IPs per interface and multiple interfaces per instance, but you are limited depending on the instance type. For example, a t2.micro instance can have 2 interfaces and 2 IPs per interface.

The easiest way to add an IP is to add it to the existing network interface, but when you max out the number of IPs on the interface, the next IP will have to be added to a new interface. I’ll go over the process as if I’m adding IPs to a t2.micro instance. I believe once you see the process for adding the first four IPs to such an instance, you should also be able to see a pattern emerge that should work for any number of interfaces or IPs.

Note that I will only go over adding the private IP to the instance. Once the private IP is setup, the associated public IP, if there is one, should just work.

All commands will probably need to be run with root privs.

Let’s refer to the current primary private IP associated with the instance as x1.x1.x1.x1.

Adding the Second IP

In the Amazon EC2 console, go to the Network Interfaces page. Select the interface associated with the instance. There should only be one at this point. Click “Assign new IP”r and then “Yes, Update.” You should now see the new IP listed in Secondary private IPs. Make a note of this IP. We’ll refer to it going forward as x2.x2.x2.x2. Also make a note of the subnet mask that appears at the line up top that looks like “eth0 - eni&heelip; - ip.ip.ip.ip/s” You’re interested in the /s part of this line.

Now ssh into your server. Simply run:

# ip addr add dev eth0 x2.x2.x2.x2/s

That’s it! The IP should now work.

In order to persist the change through a reboot edit /etc/network/interfaces:

auto lo
iface lo inet loopback
auto eth0 inet dhcp
iface eth0 inet dhcp
post-up ip addr add dev eth0 x2.x2.x2.x2/s

Adding the Third IP

Since we’ve maxed out the IPs for the first network interface on our micro instance, we’ll need to create a new network interface. This can be done under Network Interfaces in AWS.

Once you’ve created a new interface, a private IP should already be associated with it. Let’s refer to this IP as x3.x3.x3.x3.

Add the new interface to /etc/network/interfaces:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp
post-up ip addr add dev eth0 x2.x2.x2.x2/s

auto eth1
iface eth1 inet dhcp

Restart networking by running this on the shell:

# /etc/init.d/networking restart

Load the current routing data by running this on the shell:

# ip route show

The response will look something like this:

default via g.g.g.g dev eth0
<some_ip>/<some_number> dev eth0  scope link
<some_ip>/<some_number> dev eth0  proto kernel  scope link  src x1.x1.x1.x1
<some_ip>/<some_number> dev eth1  proto kernel  scope link  src x3.x3.x3.x3

The only data we care about is g.g.g.g in the “default via” line. Make a note of it.

Run the following in the shell:

# ip route add default via g.g.g.g dev eth0 tab 1
# ip rule add from x1.x1.x1.x1/32 tab 1 priority 500
# ip route add default via g.g.g.g dev eth1 tab 2
# ip rule add from x3.x3.x3.x3/32 tab 2 priority 600

The third IP should now be active.

Persist the changes by editing /etc/network/interfaces:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp
post-up ip route add default via g.g.g.g dev eth0 tab 1
post-up ip rule add from x1.x1.x1.x1/32 tab 1 priority 500
post-up ip addr add dev eth0 x2.x2.x2.x2/s

auto eth1
iface eth1 inet dhcp
post-up ip route add default via g.g.g.g dev eth1 tab 2
post-up ip rule add from x3.x3.x3.x3/32 tab 2 priority 600

Adding the Fourth IP

At this point, you should be getting the hang of things. Add an additional private IP to the second network interface in the Amazon EC2 console. This will be similar to how you added the second IP to the first network interface.

On the server, run the following in the shell:

# ip addr add dev eth1 x4.x4.x4.x4/s
# ip rule add from x4.x4.x4.x4/32 tabe 2 priority 700

and similarly to /etc/network/interfaces:

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp
post-up ip route add default via g.g.g.g dev eth0 tab 1
post-up ip rule add from x1.x1.x1.x1/32 tab 1 priority 500
post-up ip addr add dev eth0 x2.x2.x2.x2/s

auto eth1
iface eth1 inet dhcp
post-up ip route add default via g.g.g.g dev eth1 tab 2
post-up ip addr add dev eth1 x4.x4.x4.x4/s
post-up ip rule add from x3.x3.x3.x3/32 tab 2 priority 600
post-up ip rule add from x4.x4.x4.x4/32 tab 2 priority 700

Final Notes

Whew, we’ve now gotten all our IPs associated to the instance. I wanted to make a quick note that while I prefer this order of adding IPs, there aren’t any set rules. You could add the second IP by adding a new network interface and follow the steps for adding the third IP in my post.



Posted on .

Taipei’s bus system

Made with Leaflet.js and Photoshop


Sankey Diagrams

Posted on .

Example Sankey Diagram

Sankey diagrams give you some general insights into your site that can lead to further analysis. In the above example, a decent percentage of the users visit the FAQ page prior to shopping the site. It’s not something I would have necessarily known about based on the standard analytics. I would see traffic going to the FAQ and to shopping, but not necessarily make the connection. I also wouldn’t see it as a natural funnel. In this case, it would make sense to focus on the FAQ page and try to increase engagement from the FAQ to utilizing the cart. Disclaimer: the data is made up, but I did encounter a very similar situation in my work.

The tools that I used for this tutorial are awk, Mike Bostocks sankey library (with example here), and D3.js. The scripts that I wrote were based off of parsing data from nginx log files.

The first thing you will want to do is create a processed log file isolating the data you want. It could look something like the following:

awk '($7 !~ /^\/robots.txt/ && $7 !~ /^\/admin/ && $7 !~ /^\/images/ && $7 !~ /^\/favicon.ico/ ($9 == "200" || $9 == "302")) {print $1","substr($4,2)","substr($6,2),$7}' logfile | sort -s -k1,1 > processed_logfile

The above may be different depending on the log file format. In this case, the 1st column was the IP, the fourth colomun was the timestamp, the 6th column was the request type and the 7th column was the URL. The command also filters out URLs that we may not want, such as static files. Finally, we sort the results by the first column (IP) only.

I wrote a simple Python script to transform the data so it could be used for sankey.js:

def create_vertex(name):
    return {'name': name, 'edges': []}

vertices = [create_vertex('direct')]
with open('processed_logfile', 'r') as logfile:
    last_ip = None
    last_to_vertex_idx = None
    for line in logfile:
        # set variables
        row = line.split(" ")
        ip = row[0]
        request = row[2] + " " + row[3].rstrip()

        # create new session, otherwise use last to index
        if ip != last_ip:
            from_vertex_idx = 0
            from_vertex_idx = last_to_vertex_idx

        to_vertex_idx = None
        for idx, vertex in enumerate(vertices):
            if request == vertex['name']:
                to_vertex_idx = idx
        if to_vertex_idx is None:
            to_vertex_idx = len(vertices)

        # see if this edge already exists
        found_edge_idx = None
        for idx, edge in enumerate(vertices[from_vertex_idx]['edges']):
            if edge['to'] == to_vertex_idx:
                found_edge_idx = idx

        if found_edge_idx is None:
            vertices[from_vertex_idx]['edges'].append({'to': to_vertex_idx, 'weight': 1})
            weight = vertices[from_vertex_idx]['edges'][found_edge_idx]['weight'] 
            vertices[from_vertex_idx]['edges'][found_edge_idx]['weight'] = weight + 1

        last_ip = ip
        last_to_vertex_idx = to_vertex_idx

print '{"nodes":['
for vertex in vertices:
    print '{"name":"%s"},' % vertex['name']
print '],'

print '"links":['
for i, vertex in enumerate(vertices):
    for edge in vertex['edges']:
        print '{"source":%d,"target":%d,"value":%d},' % (i, edge['to'], edge['weight'])
print ']}'

You can then run the above to generate a json file to replace energy.json from the previously mentioned example.

The python script itself is more of a starting point as I’ve left out some of the details. For one, the script assumes all traffic is direct, but you can parse referrals to generate other root nodes. If you run the code basically as-is, you’ll get a lot of data, including funnels that the majority of users never take. Even so, it will still give you a rough overview, and let you decide which parts of the overall flow you want to focus on and explore.

Here are some other modifications you’ll most likely have to make:

Combine certain pages into a single node. Sometimes you may see nodes where the directed edges point back to each other creating a cycle. I’ve seen this happen with informational pages where users may navigate in a non-specific order, like visiting /info, /faq, and /about versus /faq, /info, /about. While the weights would not be uniform, not much can be gleaned from these cyclical edges, especially if it makes up a small percentage of traffic. In these cases, it may make sense to combine the pages into a single node.

Remove nodes and edges. With a lot of data, we usually want to tell some sort of story and we can remove the parts that are extraneous to that story. It isn’t always about removing the least trafficked edges or nodes either. I’ve seen cycles or interesting outliers that indicate the user could be confused. These data points may not be a large percentage of the data, but could indicate problems and the paths users take to resolve them. An even larger percentage could be affected but they just abandon the site instead of clicking through to more pages.

Defining the session. In my sample code, the session is only defined per user but most likely, we’ll want to put in a check for new session either through time since last request or a combination of URLs accessed and time.

Remove cycles. This is more of a warning that the original sankey.js does not support cycles, but if needed, take a look at the work of Colin Fergus to add cycles to sankey.js.

I hope this has been a good starting point for making sankey diagrams and drawing some insights into your data.


Here Be Monkeys

Posted on .

I recently had an interaction with a developer that didn't like the monkey patching pattern used on one of the websites I had created. I built the project on top of a Rails engine, which means it's an application with its own routes, controllers, models, and views. I'll refer to it going forward as the Engine. While this is actually a common design pattern in the Rails’ community, I wanted to try to have a dialogue without falling back on that as the only reason.

The problem with monkeys

Monkey patching allows the programmer to overwrite any class at runtime. While this is quite powerful, it can cause issues because the object referenced may no longer represent the original class’s specifications. An example would be overwriting Array#size to have a completely different method definition and return type. When someone looks up the Ruby Array documentation, they would expect it to behave in the original manner but the behavior is now unexpected because the programmer has completely changed it. In practice, the modifications would be more subtle and on methods with more complexity. Let’s say I overwrite the response to a route to modify a small part of the response and introduce a bug. A developer down the line may spend hours looking at the non monkey-patched version trying to figure out why the response is slightly different not realizing the code had been tampered with somewhere else. Monkey patching makes things less explicit, which is not good.

Into the jungle

The issue is that I want to customize the Engine for my own application and not have to spend too much time doing so. I could pull in the code that I need to overwrite and make it part of my project. If I do this, the code will be within my application’s domain and integrating changes upstream would be more difficult. I could also modify the engine itself, but I would then be maintaining a separate codebase. Or with monkey patching, I could modify only parts of the Engine that I want within my own codebase. This actually minimizes the amount of code written and thus, the amount of code I have to maintain.

To illustrate why less code would be written, let's suppose I want to add a method to one of the Engine's models. Without monkey patching, I would have to create my own class within my application. Since the Engine only knows about the original model, I would then be required to also pull in the controller into my application to update the model reference. If instead I modify the original model to add the method directly, the Engine will already be using my new class. I can save the step of importing the controller. Even though this seems very beneficial, I already mentioned that monkey patching comes with dangers, but what if we could mitigate some of that danger?

The main problem with monkey patching is a developer will have problems figuring out where behavior has been modified, but let’s establish a rule that the monkey patches can only exist in a predictable file structure. The structure can be similar to the hierarchies used in HMVCs. I confine my patches, let’s call them overrides, in the same directory structure with _decorator appended to it. For example, if the Engine has app/models/post, I would write a monkey patch at app/models/post_decorator. I believe this comes out to be a very good compromise. It's not a perfect solution as there are still pitfalls that can occur. For one, it’s still less obvious and explicit than other patterns, like inheritance. If the strategy is applied to extensions of the app that load before my own application, it’s more difficult to determine how the class is composed.

There’s another reason for the usage of monkey patching in the Rails’ ecosystem and that’s the ability for multiple gems to modify each other. For example, if there’s gem A and gems B and C need to modify the functionality in A but B and C don’t know of each other. By monkey patching, all the gems can just work, and having things just work is very much a part of the Rails’ philosophy.


What I've been describing is a common pattern recommended by different Rails' engines, such as Refinery CMS. I believe many people have decided that this pattern is the right way to go, especially given the Rails philosophy.

Personally, I prefer to start here and end up somewhere in the middle. I’ll take advantage of engines and use the suggested pattern to override behavior, but I’ll also slowly import more and more functionality into my own codebase, especially things closer to core functionality that I’m likely to customize quite a bit. In my opinion, the pattern falls under the same idea as technical debt. I save time by using it, but if I had all the time in the world, I would probably opt for a different solution, and I may be forced to pay off the debt at certain times.


Digging into ActiveRecord and PostgreSQL Enums

Posted on .

I came across an interesting problem in one of my ActiveRecord models (paraphrased, this isn’t the exact model):

class Event < ActiveRecord::Base
  attr_accessible :certainty
  validates :certainty, :inclusion => {
    :in => %w(less neutral more),
    :message => "%{value}"

The problem is I would set certainty to one of the accepted values, let’s say 'less', and the form would wind up throwing an error. I overrode the default error message just to retrieve the value and it turns out the value for certainty is 0.

The reason this was happening is because certainty is defined as a PostgreSQL enumerated type:

CREATE TYPE certainty AS ENUM ('less', 'neutral', 'more');

and these are the type detection methods in ActiveRecord:

From lib/active_record/connection_adapters/postgresql_adapter.rb:

def simplified_type(field_type)
  case field_type
  # Numeric and monetary types
  when /^(?:real|double precision)$/
  # Monetary types
  when 'money'
  when 'hstore'
  # Network address types
  when 'inet'
  when 'cidr'
  when 'macaddr'
  # Character types
  when /^(?:character varying|bpchar)(?:\(\d+\))?$/
  # Binary data types
  when 'bytea'
  # Date/time types
  when /^timestamp with(?:out)? time zone$/
  when 'interval'
  # Geometric types
  when /^(?:point|line|lseg|box|"?path"?|polygon|circle)$/
  # Bit strings
  when /^bit(?: varying)?(?:\(\d+\))?$/
  # XML type
  when 'xml'
  # tsvector type
  when 'tsvector'
  # Arrays
  when /^\D+\[\]$/
  # Object identifier types
  when 'oid'
  # UUID type
  when 'uuid'
  # Small and big integer types
  when /^(?:small|big)int$/
  # Pass through all types that are not specific to PostgreSQL.

and lib/active_record/connection_adapters/column.rb:

def simplified_type(field_type)
  case field_type
  when /int/i
  when /float|double/i
  when /decimal|numeric|number/i
    extract_scale(field_type) == 0 ? :integer : :decimal
  when /datetime/i
  when /timestamp/i
  when /time/i
  when /date/i
  when /clob/i, /text/i
  when /blob/i, /binary/i
  when /char/i, /string/i
  when /boolean/i

The field_type in the above method calls is 'certainty.' The simplified_type in PostgreSQLColumn doesn’t match on any of the cases and it gets passed to the parent. In the parent, Column, 'certainty' matches against 'when /int/i' and returns integer as a type. Once the integer is set as the type, ActiveRecord does its thing and converts the attribute to an integer prior to a save. The validation then triggers on the changed value.

I couldn’t figure out where the variable actually gets type checked and changed to the correct type value. I tried modifying the column type directly in the model but that didn’t prevent the variable from being converted. My solution was to simply ALTER TYPE certainty RENAME TO certain. Postgre cascaded the changes and everything ended up being okay.

I want to thank Joe for taking the lead and filing a bug report here:


Placemixer: UI Experiments in Maps

Posted on .

Back in 2008, I worked on a concept called Placemixer, an itinerary planning web application.

We wanted to create a map application with an emphasis on relevant points of interests such as airports, hotel, and restaurants. When traveling, we believed that being able to quickly locate these points of interest would be really helpful. They could then easily add these points to their itinerary..

Screenshot of main map interface

The itinerary has two types of entries: regular and time–sensitive. Time sensitive entrie are at a specific time, such as catching a show on Broadway or going to a restaurant reservation. Since these entries don't have as much flexibility in time, regular entries would be grouped by distance to time–sensitive ones. The idea is to create options through flexibility in the itinerary.

Screenshot of the itinerary sidebar

Gravitate is an extension of the above idea. By clicking on Gravitate for a place, the map would zoom to show all the locations closest to the selection. For the prototype, we were applying the feature only to hotels, because it made a lot of sense for people to plan their hotel options based on the proximity to the things they want to do.

Once the user was done, a printable version of the itinerary is offered to the user featuring directions, the itinerary listing, and nearby points to each item.

Preview of Generated Itinerary

Looking at it now, I would make a few changes to the interface. I would move an itinerary timeline to the top for pagination and create another screen with small versions of the maps corresponding to each itinerary.

Mockup for timeline view

The entire concept was targeted towards traveling to places foreign to the user. The user may travel to a place and not realize the location of the closest airports, or where certain landmarks are relative to each other. For these users, a printed itinerary would also come in handy in case Internet access was unavailable.

I believe that technology will bring about change in how we plan trips. TripIt is a perfect example of a great application simplifying complex travel arrangements. Augmented reality and the proliferation of mobile phone access to the Internet have become more widespread in the past few years. Placemixer, in the last incarnation, was targeted towards power users who would want an interface to plan out trips. I don’t know if there’s a market for such an application. I instead believe editorially created travel guides or quick auto–planning guides combined with location aware applications is the future in travel. This satisfies two core ideas, expedient planning and adaptability.


Getting Started with Map Tiling: Mapnik and Shapefiles

Posted on .

About a year ago, I was working on launching a website with a map interface. I wanted full control over everything, including generating my own map tiles instead of using a third-party provider like Google Maps. It ended up being a bad idea, because map tiles take up a lot of resources to generate and store. By going through the process I have a much better understanding of map tiles and I hope this will serve as a step-by-step process showing how to generate tiles from start to finish.

My environment is Ubuntu 8.10. To start with, I installed the following packages from the Terminal:

sudo apt-get install postgresql-8.3-postgis
sudo apt-get install python-mapnik
sudo apt-get install libmapnik-dev
sudo apt-get install imagemagick

The next thing I needed was data and I was able to obtain and play around with a San Francisco sample shapefile from Navteq.

A few important things should be noted about shapefiles. A shapefile is actually multiple files (.sbn, .shx, .shp, .dbf) connected by their filename. A shapefile is like the name describes: a file containing data for a bunch of shapes. For example, it may contain data that says draw a line from one geospatial coordinate (latitude, longitude) to another. Or it may contain data that says draw a polygon with a vertex at different coordinates. It can also contain a combination of all of these.

Each shapefile defines its own map, and for complicated maps, we typically need to overlay multiple shapefiles. One shapefile may define the city boundaries while another shapefile has all the streets. If we wanted both city boundaries and streets, we would overlay one shapefile on top of the other.

Let's look at using Mapnik to interpret these shapefiles. The hard work will be based on, which I'll set-up by running the following in the Terminal from my home directory:

mkdir mapnik
cd mapnik
svn export ./

If you're not installing with the same directory structure as me (home_directory/mapnik), there will be some differences later on, but should still work.

This file will not be immediately ready to work with my dataset. First, there are pre-populated test cases starting on line 123:

  1. # World
  2. bbox = (-180.0, -90.0, 180.0, 90.0)
  4. render_tiles(bbox, mapfile, tile_dir, 0, 5, "World")
  6. minZoom = 10
  7. maxZoom = 16
  8. bbox = (-2, 50.0, 1.0, 52.0)
  9. render_tiles(bbox, mapfile, tile_dir, minZoom, maxZoom)

And so on… The first thing I did was wipe out all the code from 123 on and replaced it with the following:

  1. bbox = (-122.4, 37.76, -122.4, 37.8)
  2. render_tiles(bbox, mapfile, tile_dir, 15, 16, "SF")

Keep in mind the sample data I requested from Navteq was the SF region. If the data you have only corresponds to a certain region, then entering a bounding box (bbox) outside of the coverage will result in a bunch of empty images. You can check the shapefile with software like QGIS to make sure the data matches what you think.

I now need to create an xml mapfile. The mapfile defines the data source and how to visualize the data, which will be used by to generate the map. I created mapfile.xml and put the following:

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE Map>
  3. <Map bgcolor="#f2eff9" srs="+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs +over">
  4. <Style name="StreetStyle">
  5. <Rule>
  6. <LineSymbolizer>
  7. <CssParameter name="stroke">#000000</CssParameter>
  8. <CssParameter name="stroke-width">0.1</CssParameter>
  9. </LineSymbolizer>
  10. </Rule>
  11. <Rule>
  12. <TextSymbolizer name="ST_NAME" face_name="DejaVu Sans Book" size="9" fill="black" halo_fill= "#DFDBE3" halo_radius="1" wrap_width="20" spacing="5" allow_overlap="false" avoid_edges="false" min_distance="10" placement="line" />
  13. </Rule>
  14. </Style>
  16. <Layer name="Streets" srs="+proj=latlong +datum=WGS84">
  17. <StyleName>StreetStyle</StyleName>
  18. <Datasource>
  19. <Parameter name="type">shape</Parameter>
  20. <Parameter name="file">Streets</Parameter>
  21. </Datasource>
  22. </Layer>
  23. </Map>

Finally, I replace lines 109 - 116 in

  1. try:
  2. mapfile = os.environ['HOME']
  3. except KeyError:
  4. mapfile = home + "/"
  5. try:
  6. tile_dir = os.environ['HOME']
  7. except KeyError:
  8. tile_dir = home + "/osm/tiles/"

with my own mapfile and tile directory, which was setup before:

  1. mapfile = home + "/mapnik/mapfile.xml"
  2. tile_dir = home + "/mapnik/tiles/"

Run ./ and the map tiles should show up in the tiles sub-directory.

Other than the bounding box, which I discussed before, the only other difference would be the xml file. By doing so, we should be able to generate an xml mapfile for any data source, even one different from Navteq shapefiles.

I would leave the first lines the same and start from the first interesting bit:

  1. <Style name="StreetStyle">
  2. <Rule>
  3. <LineSymbolizer>
  4. <CssParameter name="stroke">#000000</CssParameter>
  5. <CssParameter name="stroke-width">0.1</CssParameter>
  6. </LineSymbolizer>
  7. </Rule>
  8. <Rule>
  9. <TextSymbolizer name="ST_NAME" face_name="DejaVu Sans Book" size="9" fill="black" halo_fill= "#DFDBE3" halo_radius="1" wrap_width="20" spacing="5" allow_overlap="false" avoid_edges="false" min_distance="10" placement="line" />
  10. </Rule>
  11. </Style>

A style is applied to a particular shapefile, or other data source. Each style can be composed of different rules. Let's start with the first rule in the above snippet: LineSymbolizer. Remember when I described shapefiles? They can contain lines, polygons, and so on. The Streets shapefile is primarily made up of lines. We must then create a style for these lines in order to see them in the final tiles. Shapes without a style applied will not be seen.

The next rule is the TextSymbolizer. Text for the shapefiles can be found in the .dbf file, which can be viewed with a DBF Viewer. The file is a simple spreadsheet with the text contained mapped to the shapes. The attribute name with the value ST_NAME is the link from the TextSymbolizer to the .dbf file. If we opened Streets.dbf, we will see a column called ST_NAME with each row corresponding to a street name. If you aren't using the same shapefile as I am, then you should be consulting the .dbf file to see what text can be placed into your map.

I glossed over the specific styling rules, because I found them to be quite straight-forward. To consult a complete list and explanation, bookmark Cascadenik.

The final part of my xml file is where we define the data source and set the style for that source:

  1. <Layer name="Streets" srs="+proj=latlong +datum=WGS84">
  2. <StyleName>StreetStyle</StyleName>
  3. <Datasource>
  4. <Parameter name="type">shape</Parameter>
  5. <Parameter name="file">Streets</Parameter>
  6. </Datasource>
  7. </Layer>

I think the above is straight-forward. I defined a layer, told it I'll be styling it with the style created above, and defined the data source as a shapefile named Streets. The only confusing bit I haven't gone over yet is the srs attribute in the layer. Most shapefiles I've encountered use the WGS84 datum, but if the tiles aren't coming out, this may be the cause. To verify, open the shapefile with a program like QGIS and check the default projection being used.

I'm not familiar enough with the different projections to be able to comfortably write about them in any detail. If the bounding box is correct, and the xml is correct, then the next most likely error is the projection settings being used. You may have noticed this line as well:

+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 +x_0=0.0 +y_0=0 +k=1.0 +units=m +nadgrids=@null +no_defs +over

The above line is placed in the code to specify the projection of the tiles. In this case, I'm copying the very common method (at least for Internet maps) of using a Mercator projection onto a square distorting the Earth's major and minor axes into a sphere (the Earth is actually an imperfect ellipse) with a radius of 6,378,137. Projecting onto a square simply makes things easier and the distortions are not significant enough to adversely affect things like driving directions, points of interest, and so on. Since the source data can be a different projection, such as WGS84, we must specify both the source projection and the final projection, or mapnik won't know how to render the tiles.

I know my description of projections may not make too much sense, but I'm trying to cover a lot of ground as quickly as possible. It's actually quite an important topic deserving of its own article written by someone whose expertise is greater than mine. The main take-away is that the map being generated is not a perfect representation of the Earth, but it works for most of the typical purposes used in a service like Google Maps. If you need a more accurate picture of the Earth, it would be wise to consult the different projections.

For more reading on projections, I would recommend the following from Charlie Savage's blog:

I hope it's easy to see that from these simple beginnings, we can build more complicated maps. We can add additional layers and styles on top of what we've defined above. Each layer can correspond to a different shapefile. We might have a shapefile for points of interests, or one with consensus data. We can then add more details and information to our maps.

I hope it's also easy to see that creating a custom set of map tiles is no easy task. This is the main reason I abandoned the idea of making a custom set of tiles for my application. If you don't need a custom tileset, then don't go down this road. Beyond the sheer processing power to generate all the tiles needed, you also have to worry about the bandwidth needed to transmit these tiles. There's a reason why sites are using Google Maps and not creating their own tiles. It's a hassle to do the latter. But there are situations where Google Maps falls short. If you need more accurate maps for certain situations, different kinds of detailing on the tile, or are trying to compete directly as a map tile server, then you'll definitely need to look into creating your own tiles.