PDA

View Full Version : Linear Regression

sean
06-26-2002, 06:48 PM
Here's my situation - I have a table, where x is always incremented by 1: 1, 2, 3, 4, etc... and then the y is a mystery - but I know it's consistent. I need to find the equation for the line that is created. I know that all the data points are EXACTLY on the line. I was told that Linear (or Least Squares) regression is a good solution, but so far, I've only found a method for determining how good the line of best fit is once you've found it. My line of best fit would be exact (in other words, r=1), and I need to know how to FIND THE LINE, not test it. Anyone have any ideas? I was told about an equation involving a whole lot x and ys squared with epsilons all over the place. Ring a bell?

Davros
06-26-2002, 07:25 PM
> I know that all the data points are EXACTLY on the line

If you know this, you don't need linear regression as only two points are needed to pin down the line. This is basic school maths.

i.e.
your line is described by: y = mx + c

you need to find values for m & c such that the line passes through points

x1, y1 & x2, y2 (your two chosen points)

two simultaneous equations to find m & c:

y1 = m*x1 + c
y2 = m* x2 + c

therefore

y1 - y2 = (m*x1) - (m*x2) = m(x1 - x2);

m = (y1 - y2) / (x1 - x2)

that gives you m. Just plug the numbers into one of the above equations to get the value for c.

i.e.

c = y1 - m*x1

c = y1 - (x1*(y1 - y2)/(x1 - x2))

SilentStrike
06-26-2002, 08:51 PM
The y=mx + b solution is insufficient. It requires that there are only two points, or that if there are more than two, they all are in the same line.

I'd say just go for the least squares regression. My roomate, an engineer, had to do such an assignment in fotran last semester. The equation is big and ugly, and I don't know how it was derived, but it works :).

Davros
06-27-2002, 02:30 AM
>The y=mx + b solution is insufficient. It requires that there are only two points, or that if there are more than two, they all are in the same line

True, two points would be insufficient if all the points were did not lie on the line. However, allow me to repeat Sean's statement:

'I know that all the data points are EXACTLY on the line'

That said, I'm obviously not going to disagree that linear regression is better in a general (scattered points) case.

sean
06-27-2002, 07:39 AM
However, allow me to repeat Sean's statement
However, allow me to repeat SilentStrike's statement: y=mx+c IS insufficient - I forgot to mention it could be a curve as well. I did some research, and as it turns out, regression is not what I need. Here's the problem in it's simplest form: I have a list of numbers, it can be more than two. I could ask for a minimum of 50 if I wanted to, but Id like to keep it minimal. These points have a pattern to them, and they form a line/curve. It could be as simple as y = 2x, or it could be a cubic, or something worse. This list of numbers is y, and x is the element number from the array + 1. I've figured out one method to solve them - since the method (this is from the research last night) to find the pattern would be different for each type of curve (parabola, hyperbola, cubic, etc..), I wanted to find a pattern of predictabilty in the formulas, so that I could keep testing. I'd try for a line with 2 points, test it on the third. If it didn't work out, I'd try 3 points for a parabola and test it on the fourth, 4 for a hyperbola, test it on the fifth and so forth. Can anyone give me the formulas for the first two types of curve (I know the one for lines, so I mean parabola and hyperbola, etc...). Thanks.

Nick
06-27-2002, 08:51 AM
It's impossible. For any given set of points I can up with many different curves that fit those points. Just using polynomials I can come up with a curve that fits any set of points.

Shiro
06-27-2002, 11:44 AM
Besides Fourier synthesis which is used in calculus, there are also methods from statistics, which are, ofcourse, described at Mathworld:

http://mathworld.wolfram.com/LeastSquaresFitting.html

At the bottom of the page there are references to more methods.

It's impossible. For any given set of points I can up with many different curves that fit those points. Just using polynomials I can come up with a curve that fits any set of points.

It can be possible. According to the sampling-theorem you can get all the information of a continue-time signal into a discrete-time signal if the sampling frequency is twice the highest frequency in the continue-time signal. Which implies that for that discrete-time signal there is only one continue-time signal that fits.

sean
06-27-2002, 01:31 PM
Well theoretically, you can solve it with two points for a line, 3 for a parabola, four for a cubic, etc. If I can just find a pattern in the methods, I'd try each one in order, and then keep going until I matched one, as described above. Using finite differences, I can figure which family the line comes from. The number of levels it takes to get to a zero, is equal to the number corresponding to the family plus one. I just need to find that pattern and incorporate it into a loop. So does anyone know the method for finding parabolas and cubics, plus a quat-a-bola or what ever you call it?

SilentStrike
06-27-2002, 05:10 PM
4th degree is quartic.

Yeah, you are right.. if it's a straight line, mx + b works. For some reason I skipped over the part where he said it was going to be a line (or for some reason, interpretted a line as any function, not as linear).

But both Davros and Shiro seem to know more than I, so I'll shut up and let them handle it :).

Nick
06-27-2002, 09:11 PM
What I mean is for example
{(0, 0), (1, 1)}
and you can have
f(x) = x, f(x) = x^2, f(x) = x^3

To find one solution for f you can use finite differences to
determine the degree of the function. So let's take
(0, 0), (1, 1), (2, 8), (3, 27), (4, 64), (5, 125)
1, 7, 19, 37, 61
6, 12, 18, 24
6, 6, 6

Then if you notice that after doing the finite difference once
the series is quadratic, twice the series is linear, and a third
time constant you know that the points can be fit with a cubic equation and so you need to find a, b, c, d for
ax^3 + bx^2 + cx + d by using the point data.

sean
06-28-2002, 09:31 AM
Well supposedly, there's only one parabola that can go through 3 specific points, 1 quartic/cubic/whatever that can go through four, and so forth. If I can find the methods for finding which curves these are, and then find a pettern sufficient for being able to find infinite degrees (if you ran the program forever) of curves. I'd try two, and if it didn't work out on the third, I'd try the parabolic one, and just work my way up from there. I think I've got this whole thing figured out, except for this pattern of methods - I just need a predictable method of finding the curve. Theoretically, I'd have to have my program write it's own code and then compile it, run it, and delete the file, so that's why I was doing CGIASM a while ago - to learn something about making compilers - looks like it cam in handy.

golfinguy4
06-28-2002, 09:41 AM
You can get many different curves that go through the set of points. The only difference is that the more complicated ones will be of a higher degree. I think what you are asking for is the curve with the lowest degree.

sean
06-28-2002, 10:42 AM
Well if any of you watch Seinfeld, that episode where Jerry likes his girlfriends roommate, and George finally comes up with the long plan to switch? He runs in to Jerry's apartment, throws his jacket on the ground and yells, "I GOT IT!" Well picture me doing that because I just did. I was looking around on calculus tutorial sites, because it had a minute amount to do with the project and I figured it might help - and if it didn't - I would've learned calculus. yay. So anyway - I found in one of the chapters on parametric equations, a little side note about curve fitting, which drew my attention as you can imagine. I followed a couple of links and found myself at an applet which would have data entered, and using the linear regression model, found the line of best fit (it had options for different kinds of curves - but I won't need that if you read on). It could do this for the first four degrees of curves. Because my points are all exactly on the line, and I can rely on that, if I found the line of best fit, it would be that line (as long as I did it for the right type of curve. So here's what I thinking: If I use the model designed for lines on quadratic curves, the line won't quite fit, and the r value (a value calculated to dtermine how good the line is) will be somewhat lower than 1 (1 would indicate a perfect match - what I'm after, 0.01 would be WAY off). Once I have found the pattern that would allow me to create method after method for increasingly complex curves, I'll just loop through it until I have an r value equalling 1, then I know what kind of line it is, the equation, and I can return all kinds of useful information about the line, which would aid in my ultimate goal - decryption (which, if you look it up in the dictionary, is not actually a word!). The rest of it is just finding the pattern of methods, and then it's all basic program design and computer science from there. So basically - problem solved. Thanks for all the help guys!

sean
06-28-2002, 10:45 AM
I also forgot to mention (this is kind of crucial to the "plot" I guess of the story) something: The site had a link to the Java Source Code, which also meant I would have to go through all the decompilation if that's even possible in Java without breaking the law. Unfortunately, the server's down at the site it linked to, so I just have to wait. I' guess I'll go watch Seifeld while I do it....