Unicode Consortium

Unicode Consortium

The Unicode Consortium enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystems, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

About the Unicode Consortium

The Unicode Consortium was founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The Consortium is a non-profit, 501(c)(3)charitable organization. The membership of the Consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. The Consortium is supported financially through membership dues and donations. Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and wish to assist in its extension and implementation. All are invited to contribute to the support of the Consortium’s important work by making a donation.

For more information, see the Glossary, Technical Introduction and Useful Resources.

Investigating the algorithms that govern our lives – Columbia Journalism Review

Investigating the algorithms that govern our lives – Columbia Journalism Review

Just an old-school style investigative look into technology, data, algorithms and humanity.

As online users, we’ve become accustomed to the giant, invisible hands of Google, Facebook, and Amazon feeding our screens. We’re surrounded by proprietary code like Twitter Trends, Google’s autocomplete, Netflix recommendations, and OKCupid matches. It’s how the internet churns. So when Instagram or Twitter, or the Silicon Valley titan of the moment, chooses to mess with what we consider our personal lives, we’re reminded where the power actually lies. And it rankles.

While internet users may be resigned to these algorithmic overlords, journalists can’t be. Algorithms have everything journalists are hardwired to question: They’re powerful, secret, and governing essential parts of society. Algorithms decide how fast Uber gets to you, whether you’re approved for a loan, whether a prisoner gets parole, who the police should monitor, and who the TSA should frisk.

Algorithms are built to approximate the world in a way that accommodates the purposes of their architect, and “embed a series of assumptions about how the world works and how the world should work,” says Hansen.

It’s up to journalists to investigate those assumptions, and their consequences, especially where they intersect with policy. The first step is extending classic journalism skills into a nascent domain: questioning systems of power, and employing experts to unpack what we don’t know. But when it comes to algorithms that can compute what the human mind can’t, that won’t be enough. Journalists who want to report on algorithms must expand their literacy into the areas of computing and data, in order to be equipped to deal with the ever-more-complex algorithms governing our lives.

The reporting so far

Few newsrooms consider algorithms a beat of their own, but some have already begun this type of reporting.

Algorithms can generally be broken down into three parts: the data that goes in; the “black box,” or the actual algorithmic process; and the outcome, or the value that gets spit out, be it a prediction or score or price. Reporting on algorithms can be done at any of the three stages, by analyzing the data that goes in, evaluating the data that comes out, or reviewing the architecture of the algorithm itself to see how it reaches its judgements.

Currently, the majority of reporting on algorithms is done by looking at the outcomes and attempting to reverse-engineer the algorithm, applying similar techniques as are used in data journalism. The Wall Street Journal used this technique to find that Staples’ online prices were determined by the customer’s distance from a competitor’s store, leaving prices higher in rural areas. And FiveThirtyEight used the method to skewer Fandango’s movie ratings—which skewed abnormally high, rarely dipping below 3 stars—while a ProPublica analysis suggested that Uber’s surge pricing increases cost but not the supply of drivers.

 

….

Can an algorithm be racist?

“Algorithms are like a very small child,” says Suresh Venkatasubramanian. “They learn from their environment.”

Venkatasubramanian is a computer science professor at the University of Utah. He’s someone who thinks about algorithmic fairness, ever since he read a short story by Cory Doctorow published in 2006, called “Human Readable.” The story takes place in a future world, similar to ours, but in which all national infrastructure (traffic, email, the media, etc.) is run by “centralized emergent networks,” modeled after ant colonies. Or in other words: a network of algorithms. The plot revolves around two lovers: a network engineer who is certain the system is incorruptible, and a lawyer who knows it’s already been corrupted.

“It got me thinking,” says Venkatasubramanian. “What happens if we live in a world that is totally driven by algorithms?”

He’s not the only one asking that question. Algorithmic accountability is a growing discipline across a number of fields. Computer scientists, legal scholars, and policy wonks are all grappling with ways to identify or prevent bias in algorithms, along with the best ways to establish standards for accountability in business and government. A big part of the concern is whether (and how) algorithms reinforce or amplify bias against minority groups.

Algorithmic accountability builds on the existing body of law and policy aimed at combatting discrimination in housing, employment, admissions, and the like, and applies the notion of disparate impact, which looks at the impact of a policy on protected classes rather than itsintention. What that means for algorithms is that it doesn’t have to be intentionally racist to have racist consequences.

Algorithms can be especially susceptible to perpetuating bias for two reasons. First, algorithms can encode human bias, whether intentionally or otherwise. This happens by using historical data or classifiers that reflect bias (such as labeling gay households separately, etc.). This is especially true for machine-learning algorithms that learn from users’ input. For example, researchers at Carnegie Mellon University found that women were receiving ads for lower-paying jobson Google’s ad network but weren’t sure why. It was possible, they wrote, that if more women tended to click on lower-paying ads, the algorithm would learn from that behavior, continuing the pattern.

Second, algorithms have some inherently unfair design tics—many of which are laid out in a Medium post, “How big data is unfair.” The author points out that since algorithms look for patterns, and minorities by definition don’t fit the same patterns as the majority, the results will be different for members of the minority group. And if the overall success rate of the algorithm is pretty high, it might not be noticeable that the people it isn’t working for all belong to a similar group.

To rectify this, Venkatasubramanian, along with several colleagues, wrote a paper on how computer scientists can test for bias mathematically while designing algorithms, the same way they’d check for accuracy or error rates in other data projects. He’s also building a tool for non-computer scientists, based on the same statistical principles, which scores uploaded data with a “fairness measure.” Although the tool can’t check if an algorithm itself is fair, it can at least make sure the data you’re feeding it is. Most algorithms learn from input data, Venkatasubramanian explains, so that’s the first place to check for bias.

Much of the reporting on algorithms thus far has focused on their impact on marginalized groups. ProPublica’s story on The Princeton Review, called “The Tiger-Mom Tax,” found that Asian families were almost twice as likely to be quoted the highest of three possible prices for an SAT tutoring course, and that income alone didn’t account for the pricing scheme. A team of journalism students at the University of Maryland, meanwhile, found that Uber wait times were longer in non-white areas in DC.

Bias is also the one of the biggest concerns with predictive policing software like PredPol, which helps police allocate resources by identifying patterns in past crime data and predicting where a crime is likely to happen. The major question, says Maurice Chammah, a journalist at The Marshall Project who reported on predictive policing, is whether it will just lead to more policing for minorities. “There was a worry that if you just took the data on arrests and put it into an algorithm,” he says, “the algorithm would keep sending you back to minority communities.”

How Gmail lets spammers grab your attention with emoji ← Terence Eden’s Blog

How Gmail lets spammers grab your attention with emoji ← Terence Eden’s Blog

So, what’s going on here? How have they got an animated image into the subject line?

Here’s the raw text of the message’s subject line:

Let’s take a look at the code sequence at the start and end of the subject: =?UTF-8?B?876tqQ=

As all good geeks know, characters outside the ASCII range are encoded as Base64 in emails.

The resultant character is U+FEB69 – a “Private Use” character which has no defined representation in Unicode.

For most of us, the character “󾭩” doesn’t display as any meaningful symbol – but on the web version of Gmail, it shows up as: B69, a flashing star.

WTF?

Ok, here’s what’s going on…

Way back in the midsts of time (well, about 2009) there was no standard for Emoji. Each company made use of Unicode’s private use characters in a different way. If you had a phone from Google and sent a message using the “Glowing Star Emoji” to a phone made by another manufacturer – the symbol would either not display properly, or show up as a completely different character!

Obviously, in an interconnected world, such a situation is untenable – so Google and several other companies set up the Emoji4Unicode project.

Google uses Private Use mappings to represent Emoji (“picture character”) symbols in Unicode text. These characters are commonly used by Japanese cell phone carriers. This project makes these mappings available.

Google and other members of the Unicode consortium are also developing a proposal for the addition of standardized Emoji symbol characters to Unicode.

The Unicode consortium banged some heads together (in a friendly way) and everyone agreed on a new standardised set of characters.

The new Unicode standard has “Glowing Star” set as U+1F31F and looks like this: 🌟.
(If your computer doesn’t support Unicode 6.0 you can take a look at the official reference chart.)

But the old version lives on! The animated GIF lives at https://mail.google.com/mail/e/B69where it is used for the web version of Gmail. (You can alter that end number to get all manner of odd characters.)

Modern Android phones still recognise this relic – although, in Google’s typically slapdash fashion, Android’s Gmail app won’t display the animation in the subject line, only in the body:

Gmail Flashing

The same happens with the iOS version of Gmail. Animated in the body, not in the subject line,

Try it yourself by sending an email with the subject and body “Star 🌟 vs Animated 󾭩”.

It doesn’t seem to work in Google Hangouts – or any other Google apps, just mail.

Interestingly, when sending this characters from the web or Android version of Gmail, it adds an “X-Goomoji-Subject” header and automatically converts the characters to GIFs. The Unicode is completely stripped away from the message.

So there we have it. An ancient form of Emoji, probably all but forgotten, has been resurrected by spammers in the hope that you’ll notice their wares.

What a load of 󾓴!

Unicode Emoji

Unicode Emoji

Unicode Emoji Resources

Unicode Emoji Subcommittee

The Unicode Emoji Subcommittee is responsible for the following:

  • Updating, revising, and extending emoji documents such as UTR #51, Unicode Emoji and Unicode Emoji Charts.
  • Taking input from various sources and reviewing requests for new emoji characters.
  • Creating proposals for the Unicode Technical Committee regarding additional emoji characters and new emoji-related mechanisms.
  • Investigating longer-term mechanisms for supporting emoji as images (stickers).

The Unicode Emoji Subcommittee is a subcommittee of the Unicode Technical Committee operating under theTechnical Committee Procedures. Current co-chairs are Mark Davis (Google) and Peter Edberg (Apple).

Participation in the Unicode Emoji Subcommittee weekly video/phone meetings and mailing list is open to members of the Unicode Consortium as listed in §13.1 of the Technical Committee Procedures, plus invited guests. Contact usfor more information.

Investigating the algorithms that govern our lives – Columbia Journalism Review

Investigating the algorithms that govern our lives – Columbia Journalism Review

TO GET STARTED:

  1. How big data is unfair”: A layperson’s guide to why big data and algorithms are inherently biased.
  2. Algorithmic accountability reporting: On the investigation of black boxes”: The primer on reporting on algorithms, by Nick Diakopoulos, an assistant professor at the University of Maryland who has written extensively on the intersection of journalism and algorithmic accountability. A must-read.
  3. Certifying and removing disparate impact”: The computer scientist’s guide to locating and and fixing bias in algorithms computationally, by Suresh Venkatasubramanian and colleagues. Some math is involved, but you can skip it.
  4. The Curious Journalist’s Guide to Data: Jonathan Stray’s gentle guide to thinking about data as communication, much of which applies to reporting on algorithms as well.

IBM System/360 at the IRS – 1966 -1967 Computer History Archives – YouTube

https://www.youtube.com/watch?v=JaRzExHoUl0

Vintage 1966 film excerpt from the IRS showing how an IBM System/360 mainframe system is used in their tax processing data center. About 4 mins long, color and narration.

Nice view of a 1960’s era data center and the System/360 master console. This excerpt focuses on the System/360.

The full version of this film is called “Right on the Button” and is also available on YouTube. (The model numbers of the equipment seem to have been taped over by the IRS film maker, but it is clearly the IBM 360, and its tape, disk and punch card peripherals)

IBM System/360 at the IRS – 1966 -1967 | Computer History Archives

Learning machine learning — Benedict Evans

Learning machine learning — Benedict Evans

As has happened with many technologies before, AI is bursting out of universities and research labs and turning into product, often led by those researchers as they turn entrepreneur and create companies. Lots of things started working, the two most obvious illustrations being the progress for ImageNet and of course AlphaGo. And in parallel, many of these capabilities are being abstracted – they’re being turned into open source frameworks that people can pick up (almost) off the shelf. So, one could argue that AI is undergoing a take-off in practicality and scale that’s going to transform tech just as, in different ways, packets, mobile, or open source did.

This also means, though, that there’s a sort of tech Tourettes’ around – people shout ‘AI!’ or ‘MACHINE LEARNING!’ where people once shouted ‘OPEN!’ or ‘PACKETS!’. This stuff is changing the world, yes, but we need context and understanding. ‘AI’, really, is lots of different things, at lots of different stages. Have you built HAL 9000 or have you written a thousand IF statements?  

Back in 2000 and 2001 (and ever since) I spent a lot of my time reading PDFs about mobile – specifications and engineers’ conference presentations and technical papers – around all the layers of UMTS, WCDMA, J2ME, MEXE, WML, iAppli, cHTML, FeliCa, ISDB-T and many other things besides, some of which ended up mattering and some of which didn’t. (My long-dormant del.icio.us account has plenty of examples of both).

The same process will happen now with AI within a lot of the tech industry, and indeed all the broader industries that are affected by it. AI brings a blizzard of highly specialist terms and ideas, layered upon each other, that previously only really mattered to people in the field (mostly, in universities and research labs) and people who took a personal interest, and now, suddenly, this starts affecting everyone in technology. So, everyone who hasn’t been following AI for the last decade has to catch up.