Facebook Knows [Child] Porn When It Sees It

If you’ve ever uploaded child porn to Facebook, Google, or Dropbox, you might have noticed that you are now in jail.

Online service providers automatically detect and report child exploitation images in user files. Don’t worry, Dropbox employees aren’t actually looking through your personal photos. I mean, they probably are, but the porn detection is done by computer.

A computer built by humans, of course. I always imagined a team of engineers in a basement, scouring Tor for obscene images to feed into a convolutional neural net.

At a recent visit with Facebook’s Machine Learning team, I finally got to ask:
Tell me about child porn. How did you build the detection model? What is it like to be the guy labeling the training data? Do you have to report him to the Feds if he seems to be enjoying his job too much? How do you keep your employees from losing faith in humanity?

That’s not how they do it.

Those who watch Dateline know that child pornography is illegal to knowingly possess1. And when a computer in your control has child pornography on it, you knowingly possess it. There’s no “I only possessed it so that I could train computers to recognize it” defense in the statute.

So how do major tech companies write software to detect child porn without ever possessing any of it themselves?

The National Center for Missing and Exploited Children (NCMEC) is granted an exemption to maintain a database of known child pornography images2. Using Microsoft’s PhotoDNA technology3, each image is converted to greyscale, resized, and subdivided into a grid. A histogram of intensity gradients is created for each cell, then hashed.

photodnacreatesimage_page

Online service providers can store hash values, because it’s impossible to reconstruct an image from hash values.

When a user uploads a new image to a service provider, the image is deconstructed, and the new hash values compared against the existing database. If an image shares enough values with an existing hash set, it gets flagged.

While PhotoDNA can recognize images that have been cropped, resized, or altered, only previously-registered images can ever be detected. Never-before-seen child exploitation images would pass through Gmail undetected.

I’m not going to speculate on the mechanics behind building a child-porn detector for never-before-seen data – not publicly, anyway — except to say that it can be done. But probably not without violating 18 U.S.C. §2252.

Why can’t Facebook and Google mind their own business?

Possession of child porn is super illegal, and tech companies are responsible for the data on their servers. There’s no “I only possessed it because one of my idiot users uploaded it” defense. However, there is a safe harbor affirmative defense if the service provider reports the image to law enforcement, “promptly and in good faith4.”

Google can’t even mind its own business if you’re hosting images on your own personal web server, because Googlebot trawls the internet. Online service providers are bound by a duty to report5. If Google’s web crawler happens to crawl an exploitative image on your home server, it must dutifully report it.

Wait, what about my Privacy?

Google and Facebook and friends never promised you privacy. In fact, they explicitly promise you the opposite of privacy when you implicitly accept their Terms by using their service.

Moral of the story: Encrypt your stuff. Also, don’t knowingly possess child porn.

References:
1. 18 U.S.C. § 2252
2. 18 U.S.C. § 2258C
3. Microsoft’s PhotoDNA: Protecting children and businesses in the cloud
4. 18 U.S.C. § 2252A
5. 18 U.S.C. § 2258A

Will I Be Better Off With a Son or a Daughter?

dowry

My great-grandmother sold her youngest son in order to buy a bride for my grandfather.

I asked why my grandfather didn’t shop around, maybe try to find a cheaper wife. Apparently that’s not how human mating behavior works.

In most historical societies, a bride’s parents require payment because the woman marries into the husband’s family [1]. The offspring bear the paternal family name, till the paternal land, and defend the paternal tribe. Brides-to-be are valuable for their ability to spawn future utility.

Under normal conditions, the gender ratio is roughly 1:1. Oftentimes, such as during war, supply exceeds demand due to all the men who die fighting. Why don’t periods of excess supply drive bride prices down to zer0?

Blame polygamy. Human reproduction is an embarrassingly parallel problem. A male that wants ten offspring would get them quicker by fertilizing ten wives at once, instead of waiting 90 months for a single wife. In fact, she’d probably die by about 6 or 7.

Thus emerged the practice of wealthy men taking multiple wives. Bride price levels are maintained because many women would rather be an emperor’s umpteenth concubine than the monogamous partner of a peasant.

The sale and transfer of women leads to a tendency to treat wives like property. To mitigate this condition, families marry their daughters off with a dowry. This is a collection of assets that belong to the bride, which she can later sell and use for protection if her husband turns out to be a shit.

To address the initial question:

Poor families are better off bearing daughters, whom they can sell as brides.

Wealthy families should have a boy, and buy him lots of brides so that he can engender an army with which to defend the family wealth.

If you don’t live in an archaic transactional-marriage society, you should love all your offspring equally.

Interesting Notes:

Evolutionary biologists have theorized that sex ratios are influenced by economic conditions, where female offspring are more likely when resources are scarce [2]. Evidence of this claim in human populations has been statistically questionable.

Tibet is one of the few places I know of where women have multiple husbands. This is possibly due to land laws that prohibit property fragmentation.

References:
1. Schlegel, A. and Eloul, R. (1988), Marriage Transactions: Labor, Property, Status. American Anthropologist, 90: 291–309. doi:10.1525/aa.1988.90.2.02a00030
2. R.Trivers, D.Willard. Natural Selection of Parental Ability to Vary the Sex Ratio of Offspring, Science. 1973 Jan 5;179(4068):90-2.

Never Trust a Poll

My body is a septic tank and it is my God-given right to treat it as such.
My body is a septic tank and it is my God-given right to treat it as such.

Civil unrest is brewing at the office. Our daily lunch-catering service is under contention.

Silicon Valley Office Politics: The fights are so vicious because the stakes are so low.

Mondays-Wednesdays-Fridays, we order from Eat Club, a lunch box service that delivers from local restaurants. The other two days we use Farm Hill, a farm-to-table service that provides locally-sourced boxes of foliage.

Farm Hill

Some people appreciate the intestinal regularity provided by artisanal roughage. The rest of us recognize that our lives suck and want to dull the pain with highly processed carbs.

My boss decided that the only way to settle the lunch-catering question would be to conduct an office-wide poll.

poll0

My slack poll was quickly met with complaints. But its replacement was no better!

Screen Shot 2016-01-21 at 9.21.17 PM

Sure, the question seems unbiased. But how does one interpret the results?

Zer0 clearly won the race. However, the other half of the office could have easily colluded to throw the results. In fact, had the individuals voting for 5 or 2 known the eventual outcome, they would have sacrificed their preferences to side with 1.

Is the solution, then, to weigh the votes and find the average? That would give us 1.13. Though the individuals who chose 5 surely voted higher than their true preference in order to shift the average from the zer0s. Were the zer0-voters doing the same, but opposite?

And so the poll became a twisted two-thirds-average consensus game. When you try to be fair to everyone, you end up pleasing no one.

The likely outcome is that we’ll do nothing. Maybe that was the intention of those behind the stalemate all along.

Next Week’s Hot-Button issue: Organic fair-trade coffee providers. Blue Bottle Coffee service, or Green Mountain Keurig cups?

If you live in the Bay Area and need to feed a Flemish rabbit, click the link above for a free Farm Hill lunch.
If you live in the Bay Area and need to feed a Flemish rabbit, click the link above for a free Farm Hill lunch.

How Many Powerball Tickets Should I Buy?

lotto

It’s not completely stupid. The Powerball jackpot currently stands at $1.5 billion, with a $930 million cash value. There are only 292 million possible combinations of numbers, so there exists an arbitratrage opportunity where one could buy up every possible number combination and walk away with $346 million.

But just because tickets have a positive expected value doesn’t mean I should buy as many as I can afford. There comes a point at which the 100% chance of going broke outweighs the fractional chance of winning $1.5 billion dollars.

The correct answer is to identify the point where the expected value of my gains approximately equal the negative utility of my loss.

Disutility

My unhappiness isn’t a linearly increasing function with each dollar I lose. I certainly wouldn’t miss the first few. In fact, I might become happier spending a few dollars, because I’m buying hope.

Later, when I’m losing massive amounts of money, my disutility accelerates. It’s asymptotic, because beyond a certain point I’ll probably just jump off a bridge.

Screen Shot 2016-01-13 at 10.10.57 AM

Calibration

At about $100, I start feeling pain. At $1000, I’m really quite bummed. At $80,000, I’m ready to kill myself. Not because I’m broke, but rather for the greater good of the gene pool because I just spent a years’ salary on freaking lottery tickets.

Screen Shot 2016-01-13 at 10.23.02 AM

\mathbf{D} = 50000\cdot \tanh \left(\frac{x}{20000}-5\right)+49995

Now, the expectation of winning the Powerball is a linear function as each ticket increases my chances. We can normalize this to the disutility function by assuming my maximum positive utility is equal to the maximum disutility.

Screen Shot 2016-01-13 at 10.56.54 AM

I’m going to ignore proration risk here: It’s fine if I have to share the jackpot with one or more other winners. My happiness maxes out long before $1.5 billion. Even before $100 million. I’ve done a lot of drugs, I know where my limits are.

So we find the point where my disutility first exceeds the expected positive outcome:

I guess there should be an open circle where disutility crosses the y axis.
I guess there should be an open circle where disutility crosses the y axis. The negative-value universe where I open my own lottery and go into the business of selling tickets is an exercise for another day.

400. I should buy 400 tickets.

Although if I wanted to maximize the positive delta between expected outcome and disutility, I should only buy one. In reality, I’m lazy so I’ll end up buying zer0.

A Machine Learning Model for Salary Estimation

Screen Shot 2016-01-06 at 8.58.33 AM

Some time ago, I tried to scrape every Bay Area profile off LinkedIn until the site blocked my entire office network (Lesson learned: Use a proxy). This was Bad because we were (and still are!) hiring.

The goal was to collect enough data to create a set of classifiers that could estimate a person’s salary from their LinkedIn profile.

LinkedIn profiles were decomposed using Latent Semantic Indexing and mapped to salary estimates based on users’ current job titles. I scraped all the Bay Area salary information from GlassDoor.

Now when we encounter a new profile, we can perform a similarity query, find the nearest matching profiles, and return their salaries.

Previously this was all done using python libraries which made it too slow for public consumption. I finally got around to rewriting it all using Google’s TensorFlow libraries. The only remaining speed bump is the roundabout way I pull a user’s LinkedIn profile.

Here it is, go play with it.

I’ll write more about TensorFlow some other day, but for now I need to spend less time on this and more time on stuff that won’t get me fired.

fish

Many thanks to Aronima, TingTing, and Wenjie. GlassBowl would not have happened without them.