Some of you might remember DebConf12 in Managua, Nicaragua and the very friendly and helpful locals, who recently contacted me to tell me about their new project, so that I share this on planet Debian: A local community of openstreet map enthusiasts, of which some were involved in DebConf12, has collected for the first time detailed information about Managua's bus network!
To bring their efforts further, they will now print these maps on paper, so that even more people can use them in their daily lives.
If you haven't been to Managua, you might not be able to immediatly appreciate the usefulness of this. Up until now, there has been no map nor timetable for the bus system, which as you can see now easily and from far away, is actually quite big and is used by 80% of the population in a city, where the streets still have no names.
If this made you curious (or just brought back happy memories from 2012) please go to http://support.mapanica.net and donate some money - their campaign is running for 3 more weeks and currently they have already raised 3300 USD, enough to print some maps, but 4200 USD short of their goal. Every further donation will help to print some more maps, even something as little as 20 USD or EUR will help people in their real lifes to better understand the beast of Managua's bus route network.
Recently I decided to review my NLP studies and I believe the best way to learn or relearn a subject is to teach it. This is one in a series of 4 posts with a walk-through of the algorithms we implemented during the course. I’ll provide links to my code hosted on Github.
Disclaimer: Before taking this NLP course, the only thing I knew about Python was that ‘it’s the one without curly brackets’. I learned Python on the go while implementing these algorithms. So if I did anything against Python code conventions or flat-out heinous, I apologize and thank you in advance for your understanding. Feel free to write and let me know.The Concept
To quote Wikipedia, “Named-entity recognition (I’ve always known it as tagging) is a subtask of information extraction that seeks to locate and classify elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quanitites, monetary values, percentages, etc.”
For example, the algorithm receives as input some text
“Bill Gates founded Microsoft in 1975.”
“Bill Gates[person] founded Microsoft[organization] in 1975[date].”
Off the top of my head, some useful applications are document matching (ex. a document containing Gates[person] may not be on the same topic as one containing gates[object]) and query searches. I’m sure there are lots more, if you check out Collin’s Coursera course he may discuss this in greater depth.The Requirements
Development data: The file ner_dev.dat provided by prof. Michael Collins has a series of sentences separated by an empty line, one word per line.
Training data: The file ner_train.dat provided by prof. Michael Collins has a series of sentences separated by an empty line, one word and tag per line, speparated by a space.
Word-tag count data: The file ner.counts has the format [count] [type of tag] [label] [word]. The tags used are RARE, O, I-MISC, I-PER, I-ORG, I-LOC, B-MISC, B-PER, B-ORG, B-LOC. The tag O means it’s not an NE. This file is generated by count_freqs.py, a script provided by prof. Michael Collins. Run count_freqs.py on the training data ner_train.datThe Algorithm
Python code: viterbi.py Usage: python viterbi.py ner.counts ngram.counts [input_file] > [output_file] Summary: The Viterbi algorithm finds the maximum probability path for a series of observations, based on emission and transition probabilities. In a Markov Process, emission is the probability of an output given a state and transition is the probability of transitioning to the state given the previous states. In our case, the emission parameter e(x|y) = the probability of the word being x given you attributed tag y. If your training data had 100 counts of ‘person’ tags, one of which is the word ‘London’ (I know a guy who named his kid London), e(‘London’|’person’) = 0.01. Now with 50 counts of ‘location’ tags, 5 of which are ‘London’, e(‘London’|’location’) = 0.1 which clearly trumps 0.01. The transition parameter q(yi | yi-1, yi-2) = the probability of putting tag y in position i given it’s two previous tags. This is calculated by Count(trigram)/Count(bigram). For each word in the development data, he Viterbi algorithm will associate a score for a word-tag combo based on the emission and transition parameters it obtained from the training data. It does this for every possible tag and sees which is more likely. Clearly this won’t be 100% correct as natural language is unpredictable, but you should get pretty high accuracy.
Re-label words in training data with frequency < 5 as ‘RARE’ - This isn’t required, but useful. Re-run count_freqs.py if used.
Python code: label_rare.py
Usage: python label_rare.py [input_file]
- Uses Python Counter to obtain word counts in [input_file]; removes all word-count pairs with count < 5, store remaining pairs in a dictionary named rare_words.
- Iterates through each line in [input file], checks if word is in rare words dictionary, if so, replaces word with RARE.
Step 1. Get Count(y) and Count(x~y)
Python code: emission_counts.py
- Iterate through each line in ner.counts file: 1.1 Store each word-label-count combo in a dictionary count xy containing a dictionary for each word encountered. Each word dictionary contains key-value pairs of the label given to the word and its respective counter. i.e. count xy[Peter][I-PER] returns the number of times the word ‘Peter’ was labeled ‘I-PER’ in the training data. 1.2 The dictionary count y contains 8 items, one for each label ( RARE , O, I-MISC, I-PER, I-ORG, I-LOC, B-MISC, B-PER, B-ORG, B-LOC); at each line, add the count to its respective label in count y to obtain the absolute tag frequency, Count(y).
Step 2. Get bigram and trigram counts
Python code: transition_counts.py
- Iterate through each line in the n-gram_counts file 1.1 If the line contains ’2-GRAM’ add an item to the bigram_counts dictionary using the bigram (two space-separated labels following the tag type ‘2-gram’) as key, count as value. This dictionary will contain Count(yi-2,yi-1). 1.2 If the line contains ’3-GRAM’, add an item to the trigram_counts dictionary using the trigram as key, count as value. This dictionary will contain Count(yi-2, yi-1, yi).
- Return dictionaries of bigram and trigram counts.
Step 3. Viterbi
(For each line in the [input_file]):
- If the word was seen in training data (present in the count_xy dictionary), for each of the possible labels for the word: 1.1 Calculate emission = count_xy[word][label] / float(count_y[label] 1.2 Calculate transition = trigram_counts[trigram])/float(bigram_counts[bigram] Note: yi-2 = *, yi-1 = * for the first round 1.3 Set probability = emission x transition 1.4 Update max(probability) and arg max if needed. 2 If the word was not seen in the training data: 2.1 Calculate emission = count xy[RARE][label] / float(count y[label]. 2.2 Calculate q(yi|yi-1, yi-2) = trigram counts[trigram])/float(bigram counts[bigram]. Note: yi-2 = ∗, yi-1 = ∗ for the first round 2.3 Set probability = emission ×q(yi|yi-1, yi-2). 2.4 Update max(probability) if needed, arg max = RARE
- Write arg max and log(max(probability)) to output file.
- Update yi-2, yi-1.
Prof. Michael Collins provided an evaluation script eval_ne_tagger.py to verify the output of your Viterbi implementation. Usage: python eval_ne_tagger.py ner_dev.key [output_file]
Word is that Drupal is getting a frontend framework. From the multiple options it seems EmberJS is currently a little ahead of Angular and React. As said in Dries original post as well as in the drupal core issue created, nothing is final and everyone interested is asked to check that the set of library in the comparison is sufficient and more importantly that the criteria used for evaluation are relevant.
Discussing those details is not what this post is about, like others I've been questioning the move from Dries. Since many of us are professionals, let's put this in a professional setting and pretend that Dries is just another client making a feature request seemingly out of the blue. To him the problem and solution is clear — obvious even — and it is the only way to achieve his vision. Let's check.Client side (pun intended)
Drupal's user interfaces for content modeling (Field UI), layout management (Panels), and block management would benefit from no page refreshes, instant previews, and interface previews. These traits could also enrich the experience of site visitors; Drupal's commenting functionality could similarly gain from these characteristics and resemble an experience more like the commenting experience found in Facebook.
As the Drupal community, we need to stop thinking of Drupal as a "content management platform" and start looking at it as a "digital experience platform" used to create ideal visitor experiences.
Ideal as in enriched as in, for example, Acquia Lift. Don't get your pitchforks just now, there is no hidden agenda, just finish reading.How serious is the client
Sometimes features can be swept under the rug and everyone will feel better in the long term. Sometimes the client does not let it go. So how serious is Dries about this? The two posts directly related to frameworks contain 3 387 words and if you include the related posts you can add 10 394 more words. A busy person doesn't write a short story just for fun. So I'd say he is pretty serious about this, and if you read the trail of posts this is not going away.Client needs
We know a few things about what the client is trying to address:
- He expects the web to be completely different in 10 years.
- Most sites will need personalization.
- Better UX is crucial.
- One solution fitting core and contrib.
Since there needs to be one solution, it has to be in core from the start because contrib is not disciplined enough (by design) to come up with one homogeneous solution in less than 10 years.A little extrapolation
If you have in mind all the posts Dries has been writting on the topic for the past two years it makes sense that web components or websockets do not address the issue of rich interfaces the way a frontend framework would, also in this discussion any PHP-based solution is off-topic. It looks to me that Dries is trying to get the community as well as Drupal ready for what he believes is coming. I deeply disagree on what the future holds for the web but it doesn't mean nothing should be done, just in case. At worst we'll burn the new people who came to help us switch to the new framework.Solution
All in all, I would agree that under those assumptions, a framework is a valid tool to be using. Putting my JS maintainer hat on I would suggest to jQueryUI-it. Put it in core and use it anecdotally, and hope contrib will pick it up. Also we should chose the framework with the strongest opinion on how to do things. Because Drupal back-end is rather strongly opinionated about what the PHP should look like. It makes sense the JS should be the same.On Acquia bashing
I've spent more than 2 years as an Acquia consultant, working closely with the Office of the CTO on several big D8 improvements so I've seen how the community is treated from the inside and I've only seen good will towards it. Sometimes things are downplayed, not out of malice, but out of concern for the issue at hand. Which is why I think Dries didn't explicitly mentioned Acquia Lift — but still hinted to it — to not get dragged in an argument about Acquia's influence. There is nothing wrong with that since compared to the fears expressed during D8 cycle, we're far from the worst scenario possible.
On that topic, when people say that Acquia, big companies or startups are influencing Drupal I think they're taking a shortcut. It's more like Acquia clients are influencing Dries, and in turn he steers Drupal to what he thinks is right. But don't forget that between clients and Drupal there is a filter, it's Dries. So far I think we can agree he's been pretty great at Dictatoring Drupal. So let's at least give him the benefit of the doubt.
Put your pitchforks back and grab some paint, there is a bikeshed to paint.
List all posts by Authors, nested Categories and Titles is a WordPress Plugin I wrote to fix a menu issue I had during a complex website development. It has been included in the official WordPress Plugin repository. The Plugin is particularly suitable to all multi-nested categories and multi-authors websites handling a large number of posts and complex nested category layout (i.e.: academic papers, newpapers articles, etc). This plugin allows the user to place a shortcode into any page and get rid of a long and nested menu/submenu to show all site’s posts. A selector in the page will allow the reader to select grouping by Category/Author/Title. You can also manage to install a “tab” plugin (i.e.: Tabby Responsive Tabs) and arrange each group on its specific tab.
Output grouped by Category will look like:CAT1 post1 AUTHOR SUBCAT1 post2 AUTHOR post3 AUTHOR SUBCAT2 post4 AUTHOR ... ...
while in the “Author” grouping mode, it is:AUTHOR1 post1 [CATEGORY] post2 [CATEGORY] AUTHOR2 post1 [CATEGORY] post2 [CATEGORY] .....
The plugin installs a new menu “ACT List Shortcodes” in Admin->Tools. The tool is a helper to automatically generate the required shortcode. It will parse the options and display the string to be copied and pasted into any page.
The Plugin is holding a GPL2 license and it can be downloaded from its page on WP Plugins.